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FOREWORD 


Tue field of tests as a subject of review is less clearly defined than are the 
fields which deal with content as distinguished from method. One may 
approach the subject in a variety of ways. One may, for example, make a 
descriptive summary of the various types of tests. Again, one may discuss 
the technical problems and review the technical studies dealing with the 
construction and use of tests. Still another method of dealing with the 
subject is to discuss the practical ways in which tests may be used in dealing 
with problems of administration and teaching. If the last approach is fol- 
lowed, there is some duplication between the review of tests as they are used 
in the study of problems of curriculum, classification, methods, and the 
like, and the review of the study of these problems themselves. 

The authors of the present number have chosen to place their emphasis 
chiefly on the third mode of approach, the review of the uses of tests. In 
addition they have reviewed the chief technical problems, particularly those 
which have an obvious practical bearing. They have made no attempt to 
review or evaluate the tests which are on the market, partly because of their 
overwhelming number, and partly because of the absence of adequate evi- 
dence on which to evaluate them. 

This treatment of tests must be based partly on opinion, and involves, in 
some cases, differences of opinion. For example, the view expressed in the 
first chapter that the discriminative value of an item is to be measured by 
the correlation between the responses on that item and the responses on the 
test as a whole, assumes that the test measures a homogeneous, unified 
ability. Not all test makers agree with this assumption, as is shown in the 
second chapter. The reader will doubtless be interested to discover other 
differences in points of view. Such differences are inevitable where the 
discussion is based in part on opinion; and in this problem of the applica- 
tion of tests, on which there are now many questionings, it is inevitable 
that opinion shall be drawn upon. When the opinions are expressed by 
persons of such authority as those who have prepared this number they are 
bound to receive earnest consideration. 

FRANK N. FREEMAN, 
Chairman of the Editorial Board. 








INTRODUCTION 


Tus number of the Review of Educational Research, which has finally 
been completed only after many prayers and much anguish, departs some- 
what from the style of the preceding numbers in that more space is given 
to the expressions of opinions of the authors of the various chapters and 
to the presentations of somewhat controversial issues. After considerable 
preliminary study of the task set for us during which was noted the vast 
literature on the subject including previous summaries and reviews, the 
committee finally decided that the most promising utilization of the space 
available would consist in directing discussions-.to recent developments and 
to critical issues as well as to the general review of the field. As chairman, I 
have deliberately stimulated the discussion of controversial issues in addi- 
tion to more technical phases of testing and test construction. For it must 
be obvious that it is only through bringing out and constantly re-defining 
controversial issues that progress may be hoped for. It has seemed fitting 
also that essential technical implications of the measurement movement 
should be here treated in-so-far as space permits. 

The reader will note considerable querulousness and some overlapping 
of topics, as the different writers belabor their various problems. Thus, for 
example, while Chapter V is concerned essentially with uses of tests, the 
other chapters also treat in some measure this same topic in the develop- 
ment of the various arguments, since the general theme is “Educational 
Tests and Their Uses.” 

As chairman of the committee I wish to express my appreciation of the 
excellent cooperation of all the members, the patience of Dr. Freeman, 
Chairman, Editorial Board, in allowing us extra time, and to acknowledge 
my indebtedness to my colleagues here in our Bureau of Research: Dr. 
Angela M. Broening, Dr. Grace A. Kramer, and Dr. Harold B. Chapman, 


without whose help this undertaking would have been impossible. 


Joun L. Stenouist, Chairman, 
Committee on Educational Tests and Their Uses. 
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CHAPTER I 


Basic Considerations 


Tue most outstanding development in the field of educational tests during 
the last few years has not been in the technical advances in test construc- 
tion, important as these advances have been, but rather in the major 
strategy as opposed to the minor tactics of the use of educational measure- 
ments. 


Need of Measuring Growth over Period of Years 


Limitations of the snapshot theory of testing—The customary conception 
of tests as being essentially snapshot affairs to be given, scored, and acted 
upon at particular moments and then forgotten, is now thoroughly dis- 
credited in all informed circles. Indeed, the snapshot theory of tests and 
the correlative methods of constructing and using them are difficult to 
account for as elements of the objective testing movement. It is only by an 
appeal to history that they can be made to fit into the picture at all. When 
viewed from a historical viewpoint, it is readily seen that they are survivals 
from the traditional “system” of subjective examinations that prevailed 
almost unchallenged until the emergence of the objective testing movement 
under the leadership of Thorndike, Terman, and their followers, and that 
had for its major purpose the “passing” or “flunking” of students, and the 
“enforcement” of exceedingly vague and variable “standards.” The per- 
sistence of the snapshot theory and the correlative sporadic and unsys- 
tematic use of objective tests during the last decade, is evidenced by the 
fact that, with two or three outstanding exceptions, most of the so-called 
“standardized” tests available up to the year 1932 have existed in only two 
“equivalent” forms, which in some instances, at least, have turned out to 
be only “approximately” equivalent. We have had several series of so-called 
intelligence tests and several series of achievement tests in each of several 
matters for ten years or more: but even yet it may be confidently asserted 
that, with minor exceptions, no two such series have been made comparable, 
even though they have been edited by the same editor and published by the 
same publishing house. When viewed in the light of the insistence of the 
educational philosophers and psychologists on the importance of studies of 
growth, and of the insistence of the technicians on the importance of com- 
parability in educational measurements, the situation just alluded to shows 
up the test-makers in an unsuspected light—a light tinged with a curious 
sort of conservatism and unconscious respect for tradition. 

Systematic use of comparable tests—F ortunately, as indicated above, this 
conservatism has rapidly given way during the last few years to a clearer 
understanding of the basic necessity of measuring growth over a consider- 
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able number of years, and to a new and more adequate appreciation of the 
inescapable need for a systematic use of tests that yield comparable results. 
When considered in the light of these new conceptions, the prophetic and 
pioneering character of the twenty-odd comparable forms of the Thorndike 
Intelligence Examination for High-School Graduates, which were con- 
structed more than a dozen years ago, and of the five comparable forms of 
the New Stanford Achievement Test, which became available in 1928, is 
clearly revealed. 

One effort in the direction of providing comparable tests at the secondary 
and college levels has been made by the Cooperative Test Service (2, 3, 
5, 7, 8). Practical programs in the systematic use of comparable tests and 
cumulative records (4, 9) have been developed in several states, including 
Minnesota, Wisconsin, lowa, Ohio, Kentucky, Pennsylvania (1, 6) and 
others. 


Improved Methods of Using Tests for Guidance 


The impetus to this new conception of the réle of tests has derived very 
largely from the hard school of trial and error. There are many schools 
that began the use of tests with enthusiasm and abandoned them in dis- 
illusionment. The cause of the collapse of tests in these schools is due to the 
disorganized piece-meal manner of using the test results, and to the lack of 
confidence inspired by “standardized” tests that are not only incapable of 
yielding comparable measures year after year, but are often very inferior 
in other respects. While there has been much room for technical improve- 
ment, the chief defect in the testing movement has been the neglect of 
building an adequate philosophy and system of using test results for effec- 
tive and constructive educational guidance in the larger sense of that word. 
Not even the best tests have been well used. The test leaders have been slow 
to escape from the worst faults of the subjective system of examinations 
whose technical weaknesses they have so exhaustively analyzed and exposed. 
The greatest weakness in the traditional college entrance examinations, for 
example, is not in their technical defects, but in the snapshot, unsystematic, 
spasmodic way in which they have been used; in the systematic misuse 
which has ignored the real values and potentialities of the subjective 
examinations. It is not unexpected that teachers swamped by heterogeneous 
hordes of children, and that harrassed principals in high schools, and ad- 
mission officers and deans in colleges, should be dominated by immediate 
expediencies to the point of failing to see the forest for the trees; but until 
recently too many leaders in testing and personnel work generally seemed 
to be dominated by the theory that test results and other personnel data 
were useful only in helping to solve the immediate exigencies of grading 
for credit purposes, of admission, of classification, of grouping for instruc- 
tional purposes for a quarter-semester, or at most a year-period, with 
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variable and exceedingly opportunistic compromises with the curriculum. 
The idea that such data should be systematically recorded in comparable 
and meaningful terms, and that they should be made the subject of con- 
tinuous and prayerful study and long-time planning on the part of the 
highest educational officers, as well as of the teachers throughout the whole 
educational ladder, has been slow to emerge, partly because the sporadic 
use of objective tests and snapshot interviews has given visibly better results 
and has thus served as a valuable stop-gap, and partly because of the per- 
sistent failure of all concerned to appreciate the magnitude of the guidance 
problem. 

That the true nature and dimensions of the problem of guiding growing 
citizens through ten to twenty years of schooling into life in the wide world 
has been glimpsed only recently, is amply indicated by the fact that only 
a few years ago college personnel leaders apparently conceived of their 
tasks as being entirely independent of personnel and guidance work in the 
lower schools. The literature of a half-dozen years ago gives the impression 
that many college leaders were confident that high-pressure work with 
students after admission would solve the “college” personnel problem. 
Most teacher-training courses on personnel and guidance work are still 
announced as “Guidance in the Junior High School,” “Senior High-School 
Guidance,” and “College Personnel Technics.” 

As long as these arbitrary divisions of the educational ladder and the 
passage from one to another were exaggerated into an unnatural impor- 
tance, and as long as the problems supposed to inhere particularly in the 
passage from high school to college were considered solvable by single 
sets of “Cross-the-Rubicon” examinations, it is understandable that the 
objective testing movement should have been first considered and judged 
primarily on the basis of its contributions to “college admissions” and 
other similar momentary crises up and down the educational ladder. That 
objective tests have acquitted themselves well when so used is amply indi- 
cated by the experience of a dozen years or more with intelligence, scholastic 
aptitude, and subjectmatter achievement tests. Indeed, it was the early 
success and patent superiority of objective over subjective tests when used 
for such purposes that prolonged the false hope that such momentary make- 
shifts as college entrance examinations and placement tests would suffice 
to solve the guidance problem, and thus delayed the concept of the réle of 
comparable tests in the guidance problem, which concept now happily 
animates both testmakers and the increasing number of teachers and ad- 
ministrators throughout the educational ladder who have been active in the 
guidance movement. 

According to this conception, the highest purpose and ultimate aim of 
the objective testing movement is not to make better college entrance or 
course-credit examinations, but to help inaugurate a continuous study of 
individuals throughout the whole educational ladder by means of system- 
atically recorded comparable measures and observations which will make 
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such spasmodic examinations largely unnecessary. From this viewpoint, 
college admission is merely one aspect of the larger and vastly more im- 
portant total guidance problem. College admission will become an orderly 
and constructive process rather than the single nervous act which it now 
is—a part of the progressive learning and guiding of individuals into types 
of activities and ambitions which best suit their capacities, interests, and 
needs, from the kindergarten through the university. 

Since the dawn of this view of the dimensions of the guidance problem 
and of the réle of comparable tests in meeting the problem, many college 
leaders are beginning to ask questions which are a clear challenge to the 
leaders in the testing movement. “Why is it,” they ask, “that after more 
than a decade of objective’tests, the high schools are still unable to furnish 
the colleges with accurate and comparable indices of the achievements and 
abilities of candidates for admission to college?”* Many enlightened sec- 
ondary-school leaders are asking the same question regarding pupils that 
come to them from elementary schools in cities that have full fledged test- 
ing divisions, reference and research offices, and the like. 

Few cities that have had test divisions for a decade can supply to the 
colleges cumulative records of comparable measures on their high-school 
graduates. The same is true of elementary schools regarding graduates to 
their own high schools. Plans for testing in such cities are highly variable, 
opportunistic, and, with rare exceptions, are not primarily designed to 
throw light on individual pupils as growing entities. The choice of tests 
is made without regard to the comparability of their results with those of 
tests used previously or to be used later, and no effective effort has yet been 
reported for making them comparable ex post facto. The reports are still 
made up largely of sterile comparisons of arbitrary or accidental groups 
of pupils called “classes” or “schools,” and of “promotion” statistics which 
are nearly, if not quite, meaningless so far as the human engineering prob- 
lem of educational guidance is concerned. The obvious implications of the 
presence in “eighth-grade” classes of pupils of “fourth year” achievement, 
and vice versa, are ignored almost as completely in the 1930 as in the 1920 
reports of city “research” bureaus. The new conception of the réle of com- 
parable tests in the study of growing children has emerged none too soon 
to save the testing movement from obloquy and the schools from the wrath 
of the taxpayers. 


Long-time Guidance of Individuals through Tests 


The highest rule of measurement in education is not in the minor tactics 
of the classroom, but in the major strategy of educational guidance in the 
prophecy of long-term provisional goals for individual pupils, and the 
progressive modification of those goals in accordance with cumulative evi- 
dences of growth and of needs, intellectual, personal, and social. The use- 
fulness of tests in diagnostic and remedial work is well established and 
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universally recognized; but in-so-far as this use of tests runs counter to, 
or does not contribute to, the long-time guidance of pupils, it is not to be 
accepted as an unmixed blessing. Some observers have feared, not without 

cause, that the diagnostic and remedial applications of tests are in some 

instances inspired (perhaps unconsciously) by the fallacious “leveling” im- 

plication of the democratic ideal of popular education, and thus may repre- 

sent a subornation of test results to the enforcement of preconceived cur- 

riculums and “standards” which, in some instances at least, are wholly alien 

to the capacities and needs of the individual child who is the supposed 

beneficiary of remedial treatment. 

The first question that the school should ask and answer at least pro- 
visionally several times each year is, “What can Johnny learn, and which of 
the things he can learn should the school, in the light of all the facts, try 
to help him learn?” Many of those who accept the fact of individual dif- 
ferences fear that this question, upon which systematically used comparable 
tests can throw so much light, is too often answered by reference to the 
curriculum book and to the administrator’s convenience, and that the exi- 
gencies of maintaining the correctness of this pre-destined answer are too 
often the cue for diagnostic testing and “remedial teaching.” 

Tests should first of all tell what a pupil should try to learn—not how 
he may be cajoled, persuaded, or insidiously coerced into learning item x in 
the “standard” curriculum for grade n. If a pupil has difficulty in learning 
item x, this fact in itself may be evidence that x is not suited to his capacity 
and needs, and that he should, therefore, be given opportunity to learn 
something else, not forced by “remedial” treatment to live up to the “high 
and ever higher standards” so often found in perorations that are more 
impressive than meaningful. 

Even if some do learn the prescribed minimum under the pressure of 
“remedial” treatment, the results might not be worth the effort. Indeed, if 
we consider the attitudes of despair, the feelings of inferiority, the habits of 
dependence, the frequently temporary and superficial, if not fictitious, 
character of forced learning, and the loss of opportunity and time for 
learning something that is within the comprehension and interest of the 
pupil, it is not by any means certain that the efforts to “remedy” children 
up to prescribed minimum are not positively harmful. It is no reflection 
on the curriculum makers to say that the curriculum is not sacred, and that 
the “prescribed minimum” may be a golden calf for some children. 

Pupils who cannot learn after a year or two of trial might be excused 
even from some of the supposed “essentials” of the elementary curriculum, 
and encouraged to study some vocational subject or subjects that lie within 
their capacities and interests and thus have meaning to them. In many cases 
this procedure would not lessen their achievements in the elementary cur- 
riculum, and might engender better habits and more responsible attitudes 
in place of the disappointments that now frequently result. 











The statistical “success” of certain “methods” involving “testing and 
remedial technics” has sometimes earned credit of a very dubious sort for 
the testing movement, and has frequently obscured and ignored or begged 
the fundamental guidance problem which has been emphasized in preced- 
ing paragraphs. The fact that certain testing and remedial technics have 
raised the average of ten thousand children to the extent of .5 sigma above 
the average of a control group may be entirely adequate evidence to, prove 
the “superiority” of such technics over the methods used in the control 
groups for raising the average of such groups in the function or functions 
measured by the test; but it offers no evidence on the prior and more funda- 
mental question as to which, if any, of the children involved should or 
should not attempt to master or get a smattering of the subject connoted 
by the tests used. 

Let it be assumed, for example, that in a certain large area a certain 
change in tests, examination technics, or methods has indubitably raised 
the average of fifty thousand ninth graders in algebra (almost any other 
subject would serve in this example) to the extent of .5 sigma. Such gains 
have, in some instances at least, been interpreted to be analogous to raising 
the basal wage-rate of fifty thousand employees from, let us say, five dollars 
to five dollars and fifty cents per day. The inference is that algebra (plus 
the device or devices in question) is not only bet- 
ter, but better for all the pupils involved by some 
amount proportional to .5 sigma. The fallacy of 
this inference involves a great deal more than a 
| statistical fallacy. 

\ Indeed, for the purposes of this discussion, the 
statistical fallacy may be ignored; and attention 
be confined to the more important facts in the 
case, which are the fundamental assumptions and 
prejudices which underlie and motivate the fal- 
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any degree is good for all who now take the course, and that any degree of 
increased achievement in algebra is good for any and all such pupils, then 
the upward shift of the distribution might be regarded as a clear gain. But if 
all or part of this assumption is rejected, then the unmixed value of the gain 
is at least questionable. If the more reasonable assumption that algebra is 
not necessary for many is made and is (all things considered) undesirable or 
futile for some, and positively harmful to a few, the raising of the whole dis- 
tribution may, to some extent, represent a misapplication of educational 
funds, a waste of children’s time, and a reduction of their opportunities for 
genuine gains in other subjects and activities. The fundamental questions 
here are: (1) Which children are ever going to use algebra, directly or in- 
directly, sufficiently to give them an adequate return on their investment? 
(2) At what point on the achievement scale does a subject become worthy of 
its cost in a utilitarian or cultural or disciplinary sense? For the purposes of 
this discussion let it be assumed that the A’s who have been raised to A+- 
have enjoyed a real gain, although it is not impossible that some of them 
might have gained something of more permanent value to their lives and to 
society by learning or doing something else in or outside the present curric- 
ulum. The same assumption might be made with regard to the B’s or even 
the C’s; but there are few unprejudiced persons who are familiar with the 
attitudes and achievements of the D and F groups in the average school who 
would seriously maintain that raising F’s to D status is worth the time and 
effort involved. Indeed, there are many observers who would maintain that 
the raising of F’s to D status represents a loss, on the theory that the traveller 
who goes farther on the wrong road is less fortunate than the traveller who 
has gone astray only a little way. 

The above observations lead to the proposition that it is the highest func- 
tion of the testing movement to help remove such fundamental questions 
from the realm of assumptions, which are complicated by academic log- 
rolling and pietistic prejudices, and to put the matter on a plane of rational 
consideration. The prime requisite for answering this question for indi- 
vidual children is to learn more than is now known about the capacities 
and interests of individual children.#Ten years of research have fairly 
established the fact that while the occasional use of tests may mitigate mal- 
adjustments, such use of tests has signally failed to solve the problem of 
individual adjustment to the satisfaction of the schools or of society. Hence 
the plea herein is made for the systematic use of comparable tests and the 
careful study of their cumulative indications along with other relevant data 
for the long-time guidance of individuals and the comprehensive planning 
of their whole educational careers. 

It seems clear that whatever contribution the systematic use of compar- 
able tests may make to the problem of individual pupil adjustment will 
also tend to clarify, if not solve, some of the other problems thet now beset 
our schools. In the hypothetical case considered above, it seems obvious 
that limiting algebra instruction to the smaller number of pupils who can 
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really learn and use or enjoy it, would have two or three effects of funda- 
mental importance in addition to that of raising the level of achievement. 
In the first place, it would free a large number of children from efforts 
that are foredoomed to failure and allow them the possibility of devoting 
their time and energy to activities that are appropriate to their capacities, 
interests, and needs. The importance of this aspect of the matter has usually 
been overlooked by special pleaders for the curricular status quo. The 
usual form of their argument is that even those who fail are getting some- 
thing out of it. Logically this argument is an appeal to the doctrine of 
despair; and unless its fallaciousness and harmful results are clearly ap- 
prehended, it will contribute to the continuance of the present wholesale 
sacrifice of children on the altar of erroneous prescriptions. 

In the second place, the limitation of algebra instruction to the pupils 
described above would tend to raise the level of teaching ability, because 
fewer teachers would be required for the smaller number of children. The 
admittedly large number of persons who now teach algebra as an assign- 
ment, who do not understand its uses or appreciate it as a method of 
thought, and who have no vital interest in it, would be released for school 
tasks more appropriate to their capacities and tastes. 

In the third place, this policy would reduce the cost of education and in- 
crease its constructive output. This does not mean that educational budgets 
would be reduced, but rather that they would be redistributed and more 
effectively used; and would thus open up the possibility of making the 
now hard-pressed and dissatisfied taxpayer willing to contribute larger ap- 
propriations. 


Illustrative Use of Cumulative Records of Comparable 
Measurements 


The importance of cumulative records of comparable measurements may 
be illustrated by the five-year records of two thirteen-year-old boys who 
were brought to the writer’s attention in May, 1929, when they were both 
in the sixth grade. The principal had just administered the Stanford 
Achievement Test, Form A, and found that John had a total score at the 
70th percentile, and Frank, at the 40th percentile of thirteen-year-old boys. 
The principal concluded that the difference was not great enough to justify 
drastic action; and in view of the unreliability of even the best tests and the 
almost total absence of any other meaningful information about the two 
boys, his decision to allow them to go “through the mill” seemed wise. Then 
it was accidentally discovered that both boys had taken Form B in May, 
1928, and Form A in April, 1927, and two intelligence tests in September, 
1926, and January, 1927. When the results of all these tests were graphed on 
the American Council on Education Cumulative Record Form, it was appar- 
ent that the two boys were in entirely different intellectual levels, their near- 
est approach to each other having been in their age-percentile ranks in May, 
1929. In all the other tests John had consistently been at or above the 90th 


12 





ment. 
fforts 
oting 
‘ities, 
ually 

The 
;ome- 
ne of 
y ap- 
esale 


upils 
cause 

The 
sign- 
od of 
shool 


id in- 
dgets 
more 
zy the 
T ap- 





percentile, and Frank, at or below the 30th percentile. With this array of 
evidence, the principal felt justified, in spite of the fact that they had re- 
ceived average teacher's marks of B and C, to make radical changes in their 
curriculums, and still more drastic changes in the provisional plans for 
their later educational careers. The tests given to these boys in 1930 and 
1931 confirmed the indications of the earlier tests, with which they were 
comparable. 

Importance of comparability—The importance of using tests that yield 
comparable measurements over a series of years is well illustrated by the 
two sample cases just described. Although the two boys were approxi- 
mately two standard deviations apart, they nevertheless received average 
teachers’ marks of B and C respectively. In a few cases the English grade 
of the less competent of the two was superior to the English grade of the 
more capable of the two boys. 

It has already been pointed out that the fundamental weakness in the 
objective testing movement to date is that only two or three of the vast 
number of series of tests that have been available during the last decade 
have been made to yield comparable results. It is this major weakness in the 
history of test construction that shows up the existence of the snapshot 
theory of educational tests and the lack of an adequate appreciation of the 
systematic use of the tests in the study of growth in individuals. It is grati- 
fying to be able to record that leaders of several of the better organized 
state testing programs are now definitely addressing themselves to the prob- 
lem of constructing comparable examinations and using them systemati- 
cally. As noted above, it was an appreciation of the needs for comparable 
tests that specifically led to the organization of the Cooperative Test Service 
of the American Council on Education. 

The fallacies of standards—The fundamental weakness of standards has 
been their vagueness or meaninglessness. As currently used, the word 
standard has no place in educational literature outside the perorations of 
convention orators. The absurdity of the connotation in which it is used in 
such speeches is clearly exposed when the statements concerning educational 
standards are transferred to the more tangible rubrics like height and 
weight. The typical exhortation to teachers to seek “high and ever higher 
standards” is about as meaningful as would be the exhortations from doc- 
tors that children should be grown taller and ever taller. In this discussion 
it is not necessary to advert to the logical and psychological fallacies that 
are implied by such use of the word standard. It is enough to point out its 
meaninglessness. 

Speaking more constructively, it is sufficient to point out that educa- 
tional standards are necessarily individual, and in their fundamental nature 
are akin to the standards of tailors and shoemakers who judge the quality 
of their products by how well they fit the individual for whom they are in- 
tended and who pays for them,.and how long they serve him. 
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Critical Issues in the Construction of a General Achievement Test! 


The need for authoritative and specific descriptions of subjectmatter—In 
order to construct a valid and acceptable general achievement test, the test 
constructor must have before him an authoritative and specific description 
of the subjectmatter to be tested. Descriptions which have either one, but 
not both, of these characteristics are almost equally valueless. An achieve- 
ment test is a collection of specific items dealing with specific materials. 
No single test item can test directly and by itself for the attainment of an 
ultimate objective. Ultimate objectives only represent convenient ways of 
describing immediate and more specific objectives collectively, and are 
tested only indirectly by testing for the attainment of these specific objec- 
tives. The materials used in testing for the attainment of immediate ob- 
jectives are identical with, or highly similar to, the materials used in teach- 
ing to attain those objectives. Someone, either the test builder or the 
curriculum maker, must break down the ultimate objectives into immediate 
and specific objectives, and must prepare specific instructional materials 
for them, before construction of test items becomes at all possible. It is of 
little value for the test builder to know, for instance, that the ultimate 
objective of the social studies is “to develop better citizens, and well- 
rounded and cultured individuals,” and that the intermediate objective is 
“to develop in the pupil a reasoned understanding of an insight into the 
nature, function, and evolution of . . . . institutions and practices and 
problems,” unless, or until, he is told what specific items of related infor- 
mation, what specific relationships, and what specific ideas, generalizations, 
and broad concepts, constitute the accepted subjectmatter of instruction. 
Neither is it of much value to him to receive such suggestions for the content 
of test items, unless he has some reason to believe that this content will be 
generally accepted as belonging to the subjectmatter involved, that is, unless 
it is described authoritatively. The only authoritative sources of such spe- 
cific descriptions of content which have been made available thus far to the 
test constructor are the textbooks and courses of study now being used in 
instruction. It is inevitable, therefore, that present achievement tests will 
follow closely the content of the best of present textbooks and courses of 
study, not because this content is that which ought (presumably) to be 
taught, but rather because it is the only content which thus far has been 
described in a form sufficiently specific, meaningful, and authoritative to 
make test construction possible. 

Selection of the content of an achievement test—There are two conflicting 
considerations involved in the selection of the content basis for achieve- 
ment test construction. In the first place, it is desirable that such tests avoid 
the appearance of encouraging the continuance of the status quo in curricu- 


1For a fuller treatment of these issues, see: Lindquist, E. F., and Anderson, H. R. 
The Nature and Function of a General Achievement Test in the Social Studies. lowa 
City, Ia.: the Authors (State University of Iowa). 


14 








> test 
ation 
, but 
ieve- 
“ials. 
f an 
ys of 
| are 
bjec- 
» ob- 
-ach- 
- the 
diate 
rials 
is of 
mate 
well- 
ve is 
> the 
: and 
nfor- 
ions, 
tion. 
ntent 
ll be 
nless 
spe- 
o the 
sd in 
will 
es of 
o be 
been 


ve to 


cting 
ieve- 
void 
ricu- 


H. R. 
Iowa 





] 


lum content, thereby seemingly constituting a hindrance to progress and 
improvement in curriculum building. From this point of view, it would 
appear that tests intended for wide-spread use should be based on the most 
advanced and approved content available. In the second place, it is de- 
sirable that such tests provide as accurate measures as possible of the extent 
to which the pupil has achieved what he has been encouraged and given a 
reasonable opportunity to achieve. From this point of view, it seems that 
standardized achievement tests should be based on what is now being taught 
and learned. A compromise between these two conclusions is, of course, 
desirable, but it is important that both viewpoints be given adequate con- 
sideration in arriving at that compromise. 

From the viewpoint of the teacher and the pupil it would appear de- 
sirable to construct standardized general achievement tests, so that they are 
as purely as possible measures of the success with which pupils have learned 
that which they have been encouraged and given a reasonable opportunity 
to learn, and so that they are as little as possible measures of the con- 
formity of local courses of study to the latest suggestions for curriculum 
revision and improvement. While this condition could be approached by 
making the content basis of the test that of the typical local course of study, 
or by limiting it to that content which is common to the various local 
courses of study, such practice might appear to encourage or to sanction a 
static condition in the curriculum, and therefore is not to be considered. 
There is, however, a better alternative. That alternative is to base the test 
on that subjectmatter which is common to the best of the textbooks and 
courses of study now in use, and to place the major emphasis in the test on 
reasoned understanding of that content rather than upon the factual content 
itself. 

Validity of test item—Any single achievement test obviously cannot hold 
the pupil directly responsible for an understanding of, or for the ability to 
use, or even for a verbal learning of, all of the specific information, rela- 
tionships, ideas, generalizations, etc., which constitute a given field of sub- 
jectmatter. The subjectmatter of United States history, for example, con- 
sists of thousands of items of information, and thousands of related ideas, 
generalizations, inferences, and implications, based on that information, 
which it might be considered desirable for the pupil to learn and under- 
stand. If each of these elements could be given a weight proportional to its 
importance of value, then the pupil’s total achievement or “general achieve- 
ment” would be measured by the weighted sum of such elements as he has 
learned and understood. This concept of a true measure of general achieve- 
ment, of course, can be only hypothetical. No single test, which will measure 
each of these many elements directly and individually in separate test items 
can be constructed and administered. 

The items constituting any given general achievement test must be con- 
sidered as representing only a very restricted sampling selected from all of 
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the items that might be constructed on the basis of the subjectmatter in- 
volved. Few schools will use a general achievement test that requires more 
than two hours for its administration, and tests of even shorter length are 
most in demand. Such tests cannot well include more than one hundred to 
two hundred items. With so restricted a sampling, it is highly important 
that each element in this sampling, or each item in the test, contribute as 
much as possible to the validity of the whole test. The validity of the 
whole test depends upon the degree to which it ranks pupils in the order of 
their true total achievement. It clearly follows that the validity of any 
single item in the test also must depend (within limits) upon the degree to 
which that item of itself discriminates between pupils of inferior and su- 
perior total achievement... If the ability of the pupils to respond correctly 
to a given item shows no relationship to their general achievement, that is, 
if the pupils who succeed on the item are not superior in general achieve- 
ment to those who fail on the item—then that item cannot contribute to the 
central purpose of a general achievement test and cannot be defended for 
inclusion in the test, regardless of the “validity” of the item for inclusion 
in the course of study and regardless of its difficulty. 

It often happens that two objective test items may prove equally “diffi- 
cult” and may hold the pupils responsible for equally valid content from 
the curriculum viewpoint, and yet the actual responses made to one may 
be much more highly related to general achievement than those made to 
the other. Certain items, apart from their difficulty or desirability, repre- 
sent far more crucial tests or indicators of general achievement than others. 
For example, in the field of elementary-school spelling, it has been shown 
that the ability to spell the word “adequately” is very highly related to 
general spelling ability, while the word “advisers” is misspelled more fre- 
quently by superior than by inferior spellers; yet both words are mispelled 
by the same proportion of pupils in a random sample of eighth graders, 
and both words appear equally desirable for inclusion in the course of 
study in spelling. 

The worth or effectiveness of a test item depends, therefore, not only 
upon its desirability for inclusion in the curriculum and upon its “difh- 
culty,” but also upon its power to discriminate between pupils of high and 
low levels of general achievement. It is important to recognize this double 
aspect of the validity of a test item, and to see clearly the relation between 
“validity” from the curriculum viewpoint and “validity” for achievement 
testing purposes as determined by the discriminating power of an item. 

The discriminating power of a single test item—The discriminating power 
of a single test item refers to the degree to which success or failure on that 
item by itself indicates possession of the general ability which is being 
measured. In relation to tests of the general achievement type, it may be 
defined as the accuracy with which a pupil can be placed along the general 
achievement scale on the basis of success or failure on the given item. An 
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item may be said to be perfect in discriminating power when every pupil 
who responds correctly to the item ranks higher on the general achieve- 
ment scale than any pupil who fails on the item. An item may be said to 
have zero discriminating power when there is no systematic difference be- 
tween the general achievement of the pupils who succeed on the item and 
those who fail. 

Otherwise stated, an item is said to discriminate if the pupils who respond 
correctly to that item are, on the average, superior in general achievement 
to those who respond incorrectly. If the pupils who succeed on a given 
item are, on the average, just equal in general achievement to those who 
fail, then the item has no discriminating power. The degree of discriminat- 
ing power of an item therefore depends upon the magnitude of the differ- 
ence between the averages in general achievement of those who succeed and 
those who fail on the item. 

Various hypothetical degrees of discriminating power for a test item of 
50 percent difficulty are represented in Figure 1. This figure shows the 
various types of relationships which may be found between general achieve- 
ment, as measured by a comprehensive criterion test, and the ability to re- 
spond correctly to a single given item (in this case an item answered cor- 
rectly by 50 percent of the pupils in an experimental group). The vertical 
scale in this figure indicates the percent of pupils who responded correctly 
to the item. The placement of pupils along the (horizontal) general achieve- 
ment scale is determined on the basis of their percentile standing on the 
criterion test. The “line of discrimination” for a given item indicates the 
percentage of pupils at each level of general achievement who responded 
correctly to the item. 

Line MM in Figure 1 represents the line of discrimination for an item 
(of 50 percent difficulty) which shows perfect discriminating power, since 
every pupil below the 50th percentile of general achievement missed the 
item, and every pupil above the 50th percentile succeeded on it. The pupil 
who responds correctly to this item may then be accurately placed on the 
general achievement scale with reference to one point, in this case the 
point of median achievement. Only a dichotomous classification, of course, 
is possible—the pupil’s distance above or below the median point in general 
achievement cannot be determined on the basis of this item alone. 

Line UU in Figure 1 represents the line of discrimination for an item 
(of 50 percent difficulty) which has zero discriminating power, since the 
same percent of pupils at every achievement level responded correctly to 
the item. When a pupil responds correctly to an item of this type, there 
is no greater reason for placing him on the lower part of the general 
achievement scale than the upper; that is, he is no more likely to be good 
than poor in general achievement. Items of this type have no functional 
value in a general achievement test, regardless of their other characteristics. 
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Line VV in Figure 1 represents the line of discrimination for an item 
(of 50 percent difficulty) which has negative discriminating power, sinc: 
it is answered correctly more frequently by pupils of inferior genera 
achievement than by pupils of superior achievement. A pupil who responds 
correctly to an item of this type is more likely to be /ow in general achieve. 
ment than one who responds incorrectly. Such items can have no functional 


value in a general achievement test, unless the pupils are “given credit” fo: 
making the wrong response, which obviously is impracticable. A larg 
number of items of this type has been discovered by the authors in their 
experimental try-outs of test materials in the social studies, and concret 
illustrations of them will be presented later. 

Between the extremes of perfect positive and negative discriminatin; 
power all degrees of discrimination may be found. These are illustrated 
(for items of 50 percent difficulty only) in Figure 1 by lines NN, OO, PP, 
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It is very important to note that there is no apparent reason for assum 
ing any relationship between the discriminating power of a test item and 
its “difficulty,” or percentage, of incorrect responses. For example, while 
only ten out of one hundred pupils may succeed on a given test item, it 
may happen that these ten pupils are on the average no higher in general 
achievement than the ninety who fail on the item in question. Similarly, 
an item may be answered correctly by eighty out of one hundred pupils 
(that is, it may be an “easy” item), and yet among the twenty pupils who 
failed on this specific item there may be many who are superior in general 
achievement to some of those who succeeded on it. The difference in aver- 
age achievement between those pupils who fail and those who succeed may 
be much greater for one item than for another of the same difficulty, or 
for a given “easy” item than for another that is very difficult. An item of 
any difficulty may have any degree of discriminating power. 
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In Figure 2, hypothetical lines of discrimination are illustrated for good 
and poor items at each of three levels of difficulty. 

Line N’N’ in Figure 2 represents an item of 80 percent correct responses 
(20 percent difficulty) with very high discriminating power, and line V’V’ 
represents an item of the same difficulty but with very low discriminating 
power. The lines N’N” and VV” represent items of 50 percent difficulty 
with high and low discriminating power respectively. Lines N’N’, NN”, and 
N’’N’” represent items of marked differences in difficulty but with the 
same degrees of discriminating power. Pupils may not always be assumed 
to be high in general achievement simply because they succeed on a “difh- 
cult” item, or low in achievement simply because they fail on an “easy” 
item. There are many “difficult” items which are more frequently answered 
correctly by inferior than by superior pupils, and many “easy” items which 
are more frequently missed by good than by poor pupils. 

In terms of this illustration, it should be apparent that the ideal test 
would consist of items of high discriminating power distributed evenly 
over the difficulty scale. In other words, the ideal test would contain some 
easy items that discriminated sharply at low levels of achievement, others 
of medium difficulty that discriminated sharply at high levels of achieve- 
ment. The significance of this ideal distribution of item difficulty will be 
discussed later. Statistical technics have been developed for expressing the 
discriminating power of a test item in terms of a single numerical index, or 
“index of discrimination.” 

Summary of restrictions upon the content of a general achievement test 
In summary, then, the specific content of a general achievement test in- 
tended for a given group of pupils is subject to the following definite re- 
strictions: 


(1) If we think of the entire group of pupils as separated into a number of levels 
of general achievement, then for each of these levels the test must contain an approxi- 
mately equal number of items calling for information that the pupils at that level have 
learned; or for ideas and generalizations that they do understand; or for judgments, 
applications, or reasoning of which pupils at that level are capable. 

(2) The items thus selected with reference to each level of achievement must 
discriminate as sharply as possible between pupils above and below that level; that is, 
it must (ideally) be highly probable that all pupils above that level will succeed on 
each of those items and that all pupils below that level will fail on each of them. 
(This is particularly important in recognition tests with reference to items that call 
for judgments beyond the ability of the pupils tested.) 

(3) The items must be such that the response scored as the “correct” or “best” re- 
sponse would be considered so by competent authorities. (If the wrong response of 
certain items could be scored as “correct,” these items would meet the first two require- 
ments. ) 


(4) The items must hold the pupil responsible for understandings, abilities, or in- 
formation that it is believed will contribute to the realization of the objectives of 
instruction; that is, they must be based upon subjectmatter which has been authorita- 
tively and specifically selected and described for purposes of instruction and which does 
belong to the field of subjectmatter involved. 
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These discussions of the technical aspects of test construction have shown 
that the problem of defining a field of subjectmatter in which general 
achievement is to be measured, and the problem of selecting elements of 
that subjectmatter for the construction of items in a general achievement 
test, present very marked and significant differences. The items in a test do 
not by any means represent a random or representative sample of the con- 
tent of the field in question. While every attempt should be made to make 
the sampling as representative as possible, certain items must be excluded 
because of technical considerations. What the test may contain is very 
largely a function of what has been learned and of what is the level and 
range of achievement in the group to be tested. The content of achievement 
tests, therefore, cannot be expected to parallel the course of study exactly, 
and such examinations can neither be used to “check” nor be checked by 
course of study outlines directly. The problem of selecting test items cannot 
be left to the subjectmatter expert or to the subjective judgment of anyone 
but is mainly a technical problem, and must be based upon objective facts 
secured from actual trials of large numbers of items with pupils of the kind 
to which the completed test is to be administered. 
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CHAPTER II 
The Selection of Test Items! 


Tue problem of the selection of test items has existed since the first test 
was given. In the days before the advent of standard tests the basis of 
selection was entirely analytical. The testers simply included those items 
or questions that, within the limits of the subjectmatter, seemed to them 
as of most importance. This was a matter of individual preference and of 
subjective judgment. The trend of the times was to allow much freedom to 
the individual examiner and everyone seems to have been happy. 

Conditions of the sort that has just been described existed until the 
advent of the standardized test. The appearance of the standardized test 
marked the beginning of a new point of view concerning the selection of 
test items. Since then this selection has been based more or less on statistical 
considerations. Among the various workers the relative emphasis given to 
the analytical and statistical methods of approach has varied. Some work- 
ers have favored the analytical approach almost exclusively, some have 
based their work mainly upon statistical considerations, while most have 
made some more or less balanced combination of the two. At all events a 
line of cleavage between analysis and statistics has existed and persisted 
since the advent of the first standardized test. 


Analytical vs. Statistical Determination of Test Items 

The appearance of the Stone Arithmetic Tests in 1908 (56) marked the 
beginning of the standardized test in its more modern sense. Stone used 
both analysis and statistics in the selection of his test items. He analyzed 
or divided his items into classes. Some of them related to computation and 
others to problems. The computation items represented each of the four 
fundamental processes. The problems were selected so as to afford situa- 
tions equally concrete to all children in the A sixth grade. Large numbers, 
particular necessary requirements, catch problems, and all subjectmatter 
except whole numbers, common fractions, and United States money were 
excluded. 

The statistical nature of Stone’s work lay in the fact that he selected 
items of known difficulty and arranged them in scaled order. From this 
simple and apparently trivial beginning has come a great statistical move- 
ment in education—one which has threatened, at times, to crowd out 
entirely the analytical approach to the selection of test items. 

In 1909, the year after the publication of the Stone Tests, Thorndike 
published his handwriting scale. The items in this scale were selected 


2 Bibliography prepared with the assistance of Dorothy Adkins. 
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entirely by statistical procedure. Apparently the entire object was to secure 
samples of handwriting of such a nature that their values in terms of 
quality would be approximately one P. E. apart after they were arranged 
in scale order. Analysis seems to have played almost no part at all. 

The Thorndike Handwriting Scale is the pattern for all of the scales 
which have appeared since. The method used in the derivation of the scale 
has been applied in other subjects and in the rating of human beings. In 
all these subsequent applications statistical considerations have been para- 
mount, but the analytical approach has also been in evidence. In English 
composition the first scale was published by Hillegas. Like the handwriting 
scale it was statistical in nature. Some of the paragraphs were frankly arti- 
ficial while others were selected from the works of standard authors. Appar- 
ently any source was utilized to obtain a composition of the desired quality. 
The disadvantages of this procedure were evident and soon a supplement 
of the Hillegas Scale appeared (67). Trabue made a substantial improve- 
ment upon the Hillegas Scale when he restricted most of his compositions 
to the subject What I Should Like to Do Next Saturday but catered to 
statistical considerations by adding some compositions of a miscellaneous 
nature at the top of his scale. 

Trabue’s supplement was followed later by Thorndike’s extension of 
the Hillegas Scale (63), in which there is definite evidence of analysis in 
the selection of the compositions. The Thorndike extension contains twenty- 
nine compositions, but only ten deal with the mere supplying of details. 
Three are concerned with cause and effect relations, three with advantages 
and disadvantages, five with interpretation, four with functional relations, 
one with inference, and one with argument. The remaining two compositions 
were letters. Since eight types of paragraphs were included, there was no 
homogeneity. To make matters even worse pupils were asked to write 
stories which had to be limited to narration only, and these stories were 
graded on the basis of a scale which contained mostly non-narrative mate- 
rial. Obviously statistical considerations were influential. 

Ballou (12) saw the defects of this procedure. He criticised the Hillegas 
Scale on the ground that it “aims to measure too varied a product,” and 
contains compositions that are “artificial, bookish, and not typical of good 
school work.” He objected also to the fact that no conversation was in- 
cluded. Ballou said, “A scale should not measure too complex a product. 
To attempt to measure the several forms or types of English composition 
by one and the same scale is like trying to measure heat, light, and color 
by the same instrument. . . . The compositions of a scale must be analyzed 
as to merits and defects, also there is not only no guarantee but little possi- 
bility that the users of the scale will interpret the qualities of compositions 
any better than they now interpret the qualities of compositions without 
the use of any objective scale.” The Harvard-Newton Scales were con- 
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structed to correct those faults. They contain separate scales for description, 
exposition, argumentation, and narration. With each composition there is a 
list of merits and defects and a comparison of that composition with others 
in the scale. The Harvard-Newton Scales are much better than those which 
preceded them. They have merited much more consideration than they 
have received. 

Unfortunately Ballou failed to break through the statistical mode of 
the time. Seven years later when composition scales began to appear again, 
statistical considerations were still dominant. The Hudelson and Willing 
Scales show little evidence of analysis, but Lewis (33) analyzed his items 
into letters containing orders, letters of application, social letters, and 
simple narration. During the eleven years since Lewis published his scales, 
only one contribution of analytical nature seems to have appeared in 
English composition. That is the work of Odell (42). Odell’s Scales 
were designed to rate pupils’ answers to thought questions. There are 
nine scales covering analysis, cause-effect, comparison, criticism, discus- 
sion, explanation, relationship, argumentation, and summary. The com- 
positions for each scale are motivated by an appropriate stimulus ques- 
tion, and each scale includes lists of merits and defects. 

While Odell apparently had no intention of doing so, it is true never- 
theless, that he has constructed a composition scale which, in the estimation 
of the reviewer at least, is the best that has yet appeared. 

The work of Buckingham (16:13) raised another troublesome question, 
In the construction of his Spelling Scale, Buckingham selected words that 
“were sufficiently common in the speaking vocabulary of third-grade chil- 
dren,” and whose spelling difficulty in most cases was “great enough to 
test the ability of eighth-grade children.” In this selection he had regard 
for both the analytical and statistical points of view. The new problem 
centers, however, in his definition of difficulty. He assumed that the proper 
difficulty of words for each grade is that in which the word is spelled cor- 
rectly by half of the pupils and missed by the other half. This conclusion 
was arrived at on the basis of considerations that are purely statistical in 
nature. Buckingham argued that words which all of the children of a given 
grade can spell and words which none of them can spell are alike useless 
for purposes of measurement. From these two theses he arrived at the con- 
clusion that the midpoint, the one in which half of the student population 
answers correctly and half answers incorrectly, is the best crucial score 
for each grade. 

In his criticism of the Buckingham Scale, Otis (44:795) said: “The 
most crucial words for any grade are believed to be those which may be 
spelled by approximately 50 percent of the pupils of that grade, or those 
words of approximately equal difficulty of which the average score of the 
pupils of that grade will be 50 percent.” Otis also made a distinction be- 
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tween testing the ability of a pupil to spell words with which the child has 
had ample time to become familiar and the probable number of words in 
the language which the pupil can spell. If the former is desired, Otis 
endorsed the plan of Ayres (11) who placed the crucial score at 84 or at 
words which 84 percent of the children spell and 16 percent miss. On this 
point Ayres said (11:37): 

All of the scales have been arbitrarily cut off at 50 percent, partly because it is doubt 
ful whether any useful teaching process is served by testing children on words of 
which they cannot spell more than 50 percent correctly, and partly because children 
of the lower grades attempting to spell difficult words frequently fail, not because of 


the inherent difficulty of the spelling, but because the word form is not yet definitely 
a part of the children’s regular vocabulary. 


Courtis (20, 21, 22) agreed with Ayres to the extent that he called the 
crucial grade from 76 to 80. Monroe placed this same grade at 70 percent 
and Haggerty, at 66 percent. 

The issue here raised is fundamental. Shall pupils be measured against 
a sort of absolute social standard or in terms of what they have been taught? 
The statistician inclines to the former point of view and the teacher and 
school administrator, to the latter. The school administrator can advance 
arguments against Buckingham’s position. As a pre-test it might be advan- 
tageous to know that none of the children are able to score even a single 
point on the performance of a task that is to be undertaken. In like manner 
there may be vast satisfaction to the teacher and all concerned when all of 
the pupils get perfect scores on some test. Surely no one would feel angry 
or disheartened when a group of pupils is able to score 100 percent in their 
responses to the fundamental combinations in arithmetic. ; 

The source of the entire difficulty, as Monroe and Ayres indicated, lies in 
the failure to consider tests in terms of the pur pose with which they are used. 
The statisticians have been interested in testing for purposes of classifica- 
tion. Under such conditions the 50-50 plan is desirable and defensible. In 
the measurement of achievement, however, the case is quite different. No 
one is satisfied with the 50-50 plan. Passing grades are seldom, if ever, 
set that low. 

The distinction between testing for classification and for achievement 
has not been clearly seen by many workers in the field. Ashbaugh (10) 
used the 50-50 plan and the same procedure was followed by Cook (18), 
Gates (27), Kelley (29), Ruch (50), Symonds (57), Thorndike (60) and 
T. Thurstone (64). All of these workers were interested mainly in improv- 
ing the statistical method as applied to testing. To them, testing is a measur- 
ing and not a teaching device. As McCall (37) put it they are “willing 
to sacrifice diagnostic ability to statistical beauty.” Wood (77) said: “In 
the construction and administration of examinations measurement must 
not be confused with pedagogy.” Apparently the test maker is to be justified 
in going outside of the school curriculum for a 50-50 test item. The fact 


24 


ld has 
rds in 
l, Otis 
1 or at 
Yn this 


S doubt 
ords of 
children 
rause of 
efinitely 


led the 


vercent 


against 
aught ? 
er and 
dvance 
advan- 
single 
nanner 
. all of 
angry 
n their 


lies in 
e used. 
ssifica- 
ble. In 
nt. No 
f ever, 


vement 
n (10) 
; (18), 
0) and 
mprov- 
neasur- 
willing 
id: “In 
t must 
ustified 
he fact 


that the desired item is one that the child has not had an opportunity to 
learn in school is in no wise a deterrent to the statistically minded test 
maker. His search for 50-50 items must not be limited by the content or 
objectives of a course of study. 

Against this point of view Monroe (39) and Wilson (75) protested. 
Wilson’s protest was answered by Kelley (29). Portions of Kelley’s defense 
are worth quoting. He said: “The merit of a yardstick used in measuring 
the height of children is not judged by its effect upon the heights of the 
children measured. That a yardstick does not affect the height of a child is 
not usually held against it.” 

Kelley admitted that “to best test the specific advances of pupils that 
have been subjected to a particular instruction in spelling, it is necessary 
that the test involve the exact words used in the spelling lessons”; but 
added, “The authors of the Stanford Achievement Tests believed that their 
function was different from this; and that they were called upon not only 
to test pupils upon specific words studied, but to test for a spelling ability 
circumscribed only by the field of common usage.” 

Kelley pointed out that school grades are not discrete. Some pupils in 
every grade have abilities that are from one to two grades higher while 
others really belong on levels one or two grades lower than the grade in 
which they are placed. Consequently if children in the eighth grade are to 
be tested adequately the words (items) in the test must vary in difficulty 
from the sixth- to the tenth-grade levels. Since 50-50 words are desired, the 
test must include 50-50 words for the sixth, seventh, ninth, and tenth grades 
as well as for the eighth grade. The 50-50 words for the ninth and tenth 
grades must therefore, of necessity, be those that have not been taught. 
The Stanford Achievement Tests are, therefore, primarily an instrument for 
the classification of pupils. The measurement of achievement is a by-product. 

The authors (31) of these tests distinguished between the measurement 
of the achievement of educational objectives and classification, but they 
combined both functions in the same tests and named their tests in terms 
of their minor function. This tends rather to perpetuate the confusion 
which exists among the rank and file upon this point. 


Difficulty of Test Items 


There is yet another open question of long standing with reference to 
the selection of test items. Previous to the appearance of the Stone Tests (56) 
it was generally assumed that test items are of equal difficulty throughout 
a given test. The essay questions of previous tests were usually presented in 
groups of ten or less, and the same credit was allowed for each question 
correctly answered. Stone, however, arranged his items in order of difh- 
culty. Some of them were more difficult than others. The scale makers 
followed this procedure also. Courtis (21, 22), however, adhered to the 
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older policy. The items in the Courtis Tests are all of approximately equal 
difficulty. Woody (78) followed in the steps of Stone and got into diffi- 
culty because he could not always find an item which was one unit more 
difficult than the preceding one. In such a situation he preferred to bring 
in a non-essential item of the proper difficulty even when it was necessary 
to omit samplings of essential points in arithmetic. Monroe (39) criticised 
this procedure and asserted that test items should be restricted to the list 
obtained from the analysis of subjectmatter. Monroe also attacked the 
policy of scaling items as a whole. Such scaling presupposes a gradual 
increase of ability on the part of the pupil; whereas the teaching of a given 
item should result in an abrupt increase. This is in line with the general 
assumption that pupils know nothing at all about given items before they 
are taught, but have virtually perfect knowledge of them after teaching. 

This question has never been settled and as a result we still have both 
kinds of tests. The advent of the so-called “new form” examinations, such 
as the true-false, multiple-choice, and completion types, has perpetuated 
the policy of constructing tests out of items of approximately equal diffi- 
culty. At the same time there is a continual crop of standard tests and scales 
in which the items have increasing difficulty. 


Validation of Test Items 


The question of the difficulty of test items merits further consideration, 
but it is necessary to turn aside for a time to consider one more new ele- 
ment that the statistical method has brought with it. In 1914 Thorndike 
(61) published his first report on the measurement of ability in reading. 
In the vocabulary portion of the test Thorndike selected as test items 
names which belong in certain well-known classifications, and controlled 
the depth of comprehension by limiting definitions to the ability of the 
pupil to associate each species with its appropriate genus. This procedure 
constituted a “criterion” for the mastery of word meanings. The war 
brought the word criterion into more general use. It also tended to put 
test construction on a rather empirical basis. The test maker looked first 
for a list of items that would yield a “normal distribution” and would 
correlate as highly as possible with some criterion. A test constructed in 
this manner was considered “valid” without regard to the nature of the 
test items. The exigencies of war made it impossible to measure the activ- 
ities of men on the actual battle front. Therefore an “indirect measure” 
was sought. Whenever it was impossible to measure a function directly, 
recourse was had to the measurement of another function that is highly 
correlated with the first. This “indirect measurement” was necessary then 
and may still be quite useful, but it has its disadvantages. It suffers in the 
first place because it has proved impossible to find functions that correlate 
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anywhere near perfectly with any criterion. Secondly, such procedure de- 
stroys all consideration of individual test items. This sort of measurement 
reduces analysis almost to the vanishing point. It represents statistical 
method at its maximum. 

After the war the influence of this procedure upon achievement testing 
| was marked. Validation became almost entirely a matter of correlation 
7 with a criterion. The test item was completely lost for six years until Vin- 

cent (73) discovered it again. In the meantime far from adequate attention 
) was given to the selection of criteria. Upon this point Ruch and Stoddard’s 
| (51) excellent summary of methods of test validation threw some light. 
: Of the twelve methods which Ruch and Stoddard mentioned five involve 
. correlation with criteria. These criteria include judgments of competent 

persons, rating scales, school marks, previously validated measures, and 
| tests of other intellectual, non-intellectual, or educational tests. All of 
these rest ultimately on subjective judgments. This is obviously unsatisfac- 





| tory, but the reviewer has been able to find only a few studies in which 
anyone has tried to do anything about it. Toops and Royer (66) detailed ~ 
a method for constructing a criterion. The treatment is highly technical, 
but well worth the study of all those who plan to use this method of test 
validation and of those who wish to improve methods of constructing 
criteria. Thorndike (62) based his criterion of “intellectualness” on cer- 
tain objectively verified assumptions concerning that function. He (62:156) 
reported that “on the whole it is certain that we cannot trust any consensus 
of present opinion to provide an accurate measure [criterion] of the difh- 
: culty or of the intellectual difficulty of a single brief task.” Gates (26) and 
Cook (18) were skeptical about criteria, but neither offered constructive 
| suggestions concerning their improvement. 


The rediscovery of the test item dates from the publication of Vincent's 
study (73) in 1924. Since that time there has been a growing feeling that 
test items should be weighted, not only in terms of their difficulty, but also 
with reference to their individual contribution to the correlation of the test 
with its criterion. The contribution of the individual item has been esti- 
mated in a variety of ways. Lentz, Hirshstein, and Finch (32) gave the most 
complete summary and evaluation of these methods. Cook (18) also gave 
an excellent treatment of them. A few of the methods are worthy of special 
} mention here. The bi-serial R method stands high in the opinion of all but 
it involves much labor. For this reason most of the workers prefer some 
modification of the overlapping method. Cook (18) found that one form 
of the overlapping method is really superior to the bi-serial R. Lentz, 
Hirshstein, and Finch (32) found the “upper and lower third” method 
best. Clark (17) reported a method of obtaining an “index of validity” for 


: P—D 


a single test item by means of the formula I. V. = a In this formula 








I. V. is the index of validity, P the percentage of the criterion group failir 
to answer the item correctly, and D the percentage of the group who took 
the sub-test and failed to answer the same item correctly. An inspection of 
this formula shows that the right hand member reduces to zero when 
P = D and to infinity when D = 100 percent. But when P = D the index 
ought to be a maximum and it ought to be a minimum when D = 100 per- 
cent. It looks as though Clark’s formula should have been written 
; P—D 
i. V.0l———. 
1—D 

Lindquist and Anderson (34) obtained an “index of goodness” fo: 
single test items in terms of the ratio of errors made by the upper and lowe: 
quartiles. They published a test in world history which shows an “index 
of goodness” for each test item. Similar information is available also in the 
study of W. Wilson, Welsh, and Gulliksen (76). Toops (65, 66) pre- 
ferred to validate test items by means of the regression equation technic. 
His plan has the great advantage of preventing the repetition of test items 
which overlap in their contributions to the criterion. The disadvantage lies 
in the amount of statistical labor involved. More work of this sort would 
constitute a valuable contribution to our knowledge of the validity of 
single test items. 

All of the statistical methods of validating test items that have been 
mentioned thus far have been applicable only to test items requiring 
dichotomous answers. The answer must be either right or wrong. This is 
obviously a disadvantage, but fortunately McCall (36) described a method 
that does not suffer this limitation. The method is statistical and technical. 
It suffers somewhat from the inadequate explanation which McCall’s stu- 
dents gave concerning it. 

The foregoing discussion of test items indicates that considerable progress 
has been made in the selection of test items, but even yet there is doubt as 
to whether the procedure is really worthwhile. Corey (19) and Whelden 
and Davies (74) reported in favor of weighting test items, while Douglass 
and Spencer (24), Odell (41), and Potthoff and Barnett (48) found that 
weighting is not worthwhile. Tyler (71) criticised the practice of validat- 
ing individual items by their relationship to the total test score. He claimed 
that the items are not highly valid and that their relationship to the total 
test score assures mere homogeneity rather than validity. The doubt which 
arises concerning the use of a criterion in connection with the validation 
of test items has led some to pin their faith to reliability as the best means 
of validation. In this connection Symonds’ article (58) and Kelley’s note 
(30) are of interest. 

Having traced the development of statistical method, it is now time to 
give further attention to the analytical method of selecting test items. It 
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has been pointed out that the original method was analytical and that at 
least some trace of the analytical method is present in nearly every statis- 
tically derived test. But the analytical method has never been entirely 
without its defenders. Previous to 1917 Ballou made a careful analysis of 
difficulties in common fractions, and found, for example, fourteen types 
of abilities involved in the addition of fractions, each of which called for a 
different specific ability. Monroe (39) in 1917 approved of this procedure 
and commended the Cleveland Survey Tests in Arithmetic because they 
were based upon a similar analysis of subjectmatter. In 1927 Monroe (38) 
wrote: “It may be interesting to ask students to respond to puzzles or other 
exercises which are not in agreement with accepted educational objectives, 
but it is seldom worthwhile and is to be condemned.” Monroe also favors 
the retention of the essay examination as an essential means of testing the 
achievement of important educational objectives. Previously (1923) Mon- 
roe and Souders (40) listed fifty types of essay examination questions 
which were “related to the daily objectives of the teacher.” Odell made 
further contribution to the analytical method of approach.* In 1922, 
Pressey (49) wrote: “A good test covers only the really important points 
of a subject. .. . The best tests are based on very careful research as to 
the fundamental objectives in the subject concerned, and the material is 
selected with reference to its importance for these objectives.” Wilson (75) 
said: “The first fundamental criterion of a test should be that it serves the 
main curricular aim of the subject tested. The second fundamental criterion 
is that the test should properly reenforce good methods of teaching.” 
Tyler (68, 69, 70, 71, 72) is a strong believer in the analytical method. 
He (68) criticised those who assumed that measures of information are 
adequate measures of ability to think. He reported that the correlation 
between information and certain types of thinking in college biology ranged 
from .40 to .46. The correlation between information and skill with the 
microscope was only .02. Tyler’s fundamental thesis is that objective 
tests should be concerned with obtaining objective evidence of the degree 
to which students are attaining the important goals of education. “In the 
past there has been the common failure to distinguish between the con- 
tent of a subject and the mental processes which a student of this subject 
is expected to exhibit.” Tyler insists that those who build examinations 
in college subjects must have “training in the analysis of the psychological 
processes characteristic of college subjects.” He opposes the use of a single 
test in a subject because each subject has more than one objective. A test 
is needed for each objective for which attainment is to be measured. “To 
ignore this is to put a straitjacket upon education.” 


1 Odell, Charles W. Traditional Examinations and New Type Testa. New York: Century 
Co., 1928. 469 p. 
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This is a reflection of a growing view that tests may easily block educa- 
tional progress. It is claimed that teachers are inclined to teach test items- 
that they do not discriminate between the items that are related to educa 
tional objectives and those that are present in the test for statistical reasons 
only. These fears are also responsible for the feeling on the part of som: 
that all testing should be abolished. Possibly it was a fear of this sort that 
motivated Peters (46, 47) to study the relation of standardized tests to edu 
cational objectives with special reference to social needs. Peters found 
that twenty-two different types of test validation had been used by the tes! 
makers of 183 tests. Less than one-third of these had used analytical meth- 
ods. Peters (47:150) concluded that “only a few tests rest upon a system 
atic survey of social needs.” Other writers who indorse the analytical ap- 
proach include Barr (13), Seashore (53) and Sones and Harry (55). 


The Essay Type of Question 


For sixteen years it has been customary for test makers to decry the 
essay examination because of its unreliability. The result of this opposition 
has been almost to destroy the instrument whose chief function is to measur 
a pupil’s familiarity with the thought of our great thinkers. The unreliability 
of the essay examination is said to be the result of two things: subjectivity 
and limited sampling. Most of the workers in the test field have been pre- 
dominantly statistically minded and to them a lack of reliability is natu- 
rally a fatal defect. As a result, the essay examination has simply been ig- 
nored. No attempt has been made to question the condemnation that it has 
received, and until recently no one has tried to improve this type of test 
Osburn (43), however, reported a study in defense of the essay examina- 
tion. He found that the subjectivity of scoring can be decreased markedly 
by providing a list of acceptable answers, care in the statement of ques- 
tions, and control of the amount to be taken off for formal errors. With 
these improvements Osburn obtained scores of high reliability. For ex 
ample, the mean of the scores on “Name the five most famous Spanish ex 
plorers and their explorations,” assigned by seventy-five relatively un- 
trained scorers was 34 times its semi-interquartile range. The corresponding 
critical ratio for “What are the functions of leaves?” was 5, and that for 
“Compare the Articles of Confederation with the Constitution” was more 
than 6. For twenty-nine questions in history and biology all but five scor- 
ings showed a critical ratio of more than 3. While these results are incon- 
clusive, they show that the subjectivity of the scores of essay examinations 
can be controlled to a very large extent. 

The criticism of the essay examination from the point of view of inade- 
quate sampling is usually the result of superficial consideration only. A 
number of writers have compared the five or ten questions of the essay 
examinations with the one hundred items of a true-false or other new-form 


30 








examinations and have concluded at once that the essay examination in- 


cludes very little sampling. Such writers fail to consider the elements in 
the answers to essay questions. These are the only items of an essay examina- 
tion that can be compared legitimately with the items in a “new-form” test. 
The average number of items in the answers to essay examinations runs 
around nine or ten per question. 

A few writers such as Eurich (25), Sims (54), Russell (52), Talbott and 
Ruch (59) have examined essay questions more closely. Both Eurich and 
Sims were interested in making essay examination questions covering the 
same ground as “new-form” tests. Sims used unimproved essay questions, 
and Eurich’s results are somewhat in doubt because the tests which he 
compared contain different numbers of items. He has sixty-nine items in 
his essay test, ninety-four in completion form, thirty-nine in multiple choice, 
and fifty in true-false. It is difficult to understand how a legitimate com- 
parison of types of examinations can be made when the number of items 
in each test varies so greatly. 

Russell (52) and Talbott and Ruch (59) were interested in comparing 
essay and “new-form” questions on the basis of intensive versus extensive 
sampling. Russell presented a diagram to show that a given pupil may 
get zero or 100 percent according to how the essay questions are selected. 
Talbott and Ruch concluded that “on the average the essay question calls 
forth less than half of the pupil’s knowledge.” All of this sounds conclusive, 
but Osburn (43) questioned the soundness of the extensive sampling pro- 
cedure. He pointed out that the theory of sampling, as applied in educa- 
tion, is borrowed from the field of mathematics. In mathematics this theory 
involves two assumptions: homogeneity and chance distribution of the 
content to be sampled. In education, Osburn pointed out, the content is 
neither homogeneous nor subject to the law of chance distribution except 
within such limited fields as may be covered by a single essay question. 
If these statements are true, serious doubt is cast upon all tests such as 
those of the “new-form” type that are based on extensive sampling, and the 
foundation of most statistical methods of validating test items will be 
dangerously beset. 

Finally the new interest in the essay examination has brought out some 
evidence that the old-time essay examination was not as bad as it seemed. 
Brinkley (15) found that “new type tests prepared by a group of high- 
school teachers gave, on the whole, lower validity coefficients than old 
type tests.” Even after instruction had been given in formulating new type 
tests the results as to validity were “slightly poorer.” Brinkley succeeded 
in making new type tests that showed slightly greater validity than the old 
type, but “the significance of the difference could not be determined.” On 
the whole Brinkley seems quite favorable to the analytical approach. 
Gates (28) used essay examination scores as an important element in his 
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criterion for true-false tests. Bayles and Bedell (14) found that comple- 
tion in story form (mutilated essay examination answers) is more valid 
than modified true-false, matching, multiple choice, and ordinary true- 
false forms when the same unit of subjectmatter is concerned. 


Summary and Conclusions 


The foregoing survey of the literature relating to the selection of test 
items makes no pretense to completeness, but it will be a matter of regret if 
any of the major contributions have been overlooked. The aim has been to 
raise into the clear as many as possible of the crucial problems that are 
involved and present them.as clearly and fairly as possible. It will be 
noticed that the conflict between the analytical and statistical point of view 
has been prominent throughout the history of the test movement. This con- 
flict is increasing rather than decreasing at the present time. It is impos- 
sible to see what the outcome will be, but in all probability some sort of 
compromise will result eventually. In the interests of more complete har- 
mony, it might be well to distinguish between testing and measuring, as 
has been the case in chemistry. In that subject, testing is a qualitative pro- 
cedure while measuring is quantitative. The former is concerned with the 
mere detection of the presence or absence of a given ingredient, while the 
latter is concerned with how much of the ingredient is present. Possibly 
the analysts will eventually be content to restrict themselves to testing as 
thus defined, leaving measurement to the statisticians. 

Other crucial questions relate to whether or not all test items should be 
of equal value, whether or not these items should be related to the cur- 
riculum as it now is, whether or not the extensive theory of sampling is 
sound in education, and how to improve the criteria of validation. 





CHAPTER III 
Recent Developments in Statistical Procedures 


] wporrant developments in the applications of statistical methods to test 
construction date, with few exceptions, from about 1920. A few sources 
prior to 1920 should receive specific mention because of their great in- 
fluence on later developments. The publication in 1913 of Thorndike’s text 
in statistical method (231) first drew the attention of American educators 
to quantitative thinking. Kelley’s doctoral dissertation (147) in 1914 in- 
troduced the technic of partial correlation and multiple regression to test 
workers. The years 1916 and 1917 saw the publication of three influential 
books: those of Starch (221), Rugg (209), and Monroe, DeVoss, and Kelly 
(171). To these sources may well be added Whipple’s Manual (257) and 
Terman’s Measurement of Intelligence (229). Before 1920 several other 
texts on measurement were published, and several numbers of the Teachers 
College Contributions to Education and the Teachers College Record de- 
scribed the construction and scaling of standard tests and scales. Since 
1920 the most influential textual treatment of the present topic is probably 
Kelley’s Interpretation of Educational Measurements (148), if one may 
select from a score or more of important treatments of measurements. 
Hull’s Aptitude Testing (141) initiated another important movement. 

Since 1920 various writers, particularly Monroe (175), McCall (165), 
and Ruch and Stoddard (207), have described in detail the steps in the 
construction of standard tests, together with the appropriate statistical pro- 
cedures. 


Validation Procedures 


Criteria of validity—Monroe (175) discussed seven criteria for the vali- 
dation of test items. Ruch and Stoddard (207) and Symonds (226) pre- 
sented more extended statements of validation methods. Wood (260) pre- 
sented evidence on the validity of the Thorndike Intelligence Examination 
and other Columbia tests. All such methods might be divided into two 
categories: “curricular” (analysis of courses of study, textbooks, examina- 
tion questions, and the like; judgments of experts; analysis of errors; 
social utility; etc.) and “statistical” (correlations against independent 
criteria; increases in percents of successes in successive grades or ages, etc.) . 

Item validation—The fundamental method of selecting items (otherwise 
adjudged valid) in educational tests is that of determining the percent of 
successes in successive age or grade groups. Chapman (91) presented a 
variant procedure in trade testing where the subjects were classified as ex- 
perts, journeymen, apprentices, and novices. In tests for use in single 
grades, as in many high-school subjects, some variation of the method of 
correlation of single items against total scores is used (91, 93, 155, 207), 


33 











Clark (93) presented a formula for the validity of test items which was 
attacked by Peatman (185) as yielding too small returns for labor in- 
volved. Whelden and Davies (256) divided the test group into three sub- 
groups by relative scores on an outside criterion test as a method of select- 
ing the most functional items. Lentz, Hirshstein, and Finch (155) com- 
pared four methods of selecting items, namely, percents of successes by 
highest and lowest thirds of group, Vincent’s overlapping method (250), 
McCall’s method (164), and Lentz’s method of the summation of agree. 
ments (156). The highest- and lowest-third method proved best in general. 

Smith (218) investigated the validity of judgments of difficulty of test 
items, as an initial stage in test construction, by experienced teachers, in- 
experienced teachers, and test experts. The judgments of experienced 
teachers proved most valid, with the test experts second, and the inex- 
perienced teachers third; the average validity coefficients being .86, .80, 
and .76, respectively. 

Validity as affected by specific determiners—Weidemann (252) demon- 
strated an important source of invalidity of test items through what he 
aptly calls specific determiners. He found that if the words always and 
never occurred in true-false items, such items were false two out of three 
times. Conversely, statements of degree or comparison were true in two- 
thirds of the cases. Brinkmeier and Ruch (87) analyzed 10,756 true-false 
items and found that 2,018 contained specific determiners and listed 15 
categories of words or phrases which act to determine the truth or falsity 
of true-false items. Brinkmeier (88) showed sentence length to be a specific 
determiner; the longer the sentence, the greater likelihood of truth. Brink- 
meier and Keys (86) described another such factor under the term of 
circumstantiality or general plausibility arising from wealth of detail. 

Mathews’ findings (162, 163) indicated that pupils select the left of two 
alternatives 33.8 percent oftener than the right, and that the upper is 
chosen 3.2 percent more frequently than the lower. Ruch and Meyer (196), 
however, failed to verify such biases. 

Relative validity of different ty pes of items—Workers in the field of new- 
type or objective examinations have attempted to discover whether com- 
pletion, true-false, multiple-choice, and similar tests measure the same 
function. Brinkley (85) found essay and objective tests to be approxi- 
mately equally valid. Ruch and DeGraff (199), assuming simple com- 
pletion items to be valid, concluded that true-false and multiple-response 
tests measured the same functions as did the completion forms. Wood 
(261), using several different criteria, found no differences in the validities 
of different types of items. Eurich (115) reached similar conclusions; and 
the studies of Paterson and Langlie (183) and Ruch and Charles (198) 
are likewise in agreement. Remmers and others (193) found presump- 
tive evidence, however, that certain individuals may do less well on one 
type of test than another. 
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Effects of instructions and scoring on validity—Several attempts have 
been made to determine the relative validity of scoring tests involving 
chance successes by the simple number right (R) method and by the cor- 
R-—-W 

rection formula, S = ae VL where S is the score, R the number right, 
W the number wrong, and n the number of choices presented. (For two- 
response tests, for example, true-false, this formula reduces to S = R — 
W.) Ruch and Stoddard (197), Paterson and Langlie (183), Wood (261), 
Ruch and DeGraff (199), and others agree that the use of the formula lowers 
the reliability. Ben D. Wood (261), Ruch and DeGraff (199), and E. P. 
Wood (262) found, on the other hand, that the formula increases the 
validity while affecting the reliability adversely. 

Ruch and DeGraff (199) also investigated the effects of instructions (a) 
to guess when in doubt and (b) to omit when in doubt. The latter seemed 
the more valid procedure, especially if the tests were scored by the formula 
for correction for chance. 

Thurstone (238), Brinkley (85), Foster and Ruch (120), and Staffelbach 
(220) considered the ideal weightings to be applied to rights, wrongs, and 
omissions. These results are not altogether in agreement, possibly because 
of differences in the instructions. The evidence suggests that R — W scor- 
ing penalizes slightly for errors, and that omissions are significant in a 
theoretical score. Weidemann (253) also suggested the latter. Holzinger 
(135) published a proof that R and R — W scores correlate to unity 
when there are no omissions. 

Time limits as a source of invalidity—A moot question has been whether 
time limits invalidate tests by making them measures of speed. May and 
Terman (264) found Army Alpha scores for single and double time cor- 
related .965; Ruch and Koerth (205) obtained .966 in the same situation 
and .945 for regular and unlimited times; and Ruch (206) found even 
higher correspondences for regular and unlimited times for the Stanford 
Achievement Test and the Terman Group Test. These writers concluded 
that the timing of tests is no serious source of invalidity. Brigham (84) 
and Frank N. Freeman (122) drew exactly the opposite conclusion from 
these data, and held that the high correlations indicate that speed is the 
main variable. Frank S. Freeman (123) reported two sets of experiments 
where single and double time correlated .97 for speed and .78 for power. 
Longstaff and Porter (159) supported the former view. Peak and Boring 
(184) held that there is a marked correlation between speed in intelli- 
gence tests and speed in simple reactions. The final test should be whether 
the slow eventually catch up with the rapid. In no case was this found to 
be true. 

Effects of group work on tests—The question whether working alone or 
in groups affects scores has been investigated by Weston and English (255) 
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and Farnsworth (116) with conflicting results. The latter found mean 
scores to be no higher in group testing than when individuals were tested 
alone. 


Measures of Reliability of Tests 
Concept of reliability—Spearman (219), Brown (90), and Kelley (148, 


151, 152) developed the concept of test reliability and formulae for ex- 
pressing errors in test scores. More general discussions were given by Sy- 
monds (225, 226), Ruch (204, 207), Mangold (160), and Foran (119). 
Symonds (225) listed twenty-five factors affecting reliability. Criticisms 
of reliability concepts and practices were brought forward by Crum (97), 
Lincoln (157), and Muenzinger (178). 

Optimum reliability of test items—Otis in 1916 showed that the most 
reliable test was one on which the average score is 50 percent of the maxi- 
mum score. Symonds (224) proved the same point and laid down six spe- 
cific principles of optimum difficulty for maximum reliability. Cleeton (94) 
and T. Thurstone (241) discussed the same issue. Bliss (82, 83) objected to 
the use of percents of successes in selecting and arranging test items. 

The Spearman-Brown Formula—Extended controversy has raged over 
the predictive accuracy of the Spearman-Brown Formula in estimating 
reliability of lengthened tests. Holzinger (134), Holzinger and Clayton 
(133), and Douglass and Cozens (105) reported over-prediction. On the 
contrary, Kelley (143), Gordon (127), Ruch, Ackerson, and Jackson (201). 
Wood (261), Remmers and others (191, 192), Lanier (154), Farnsworth 
(117), and Smith (217) found very close agreements of actual and pre- 
dicted values in such different situations as discrimination of lifted weights, 
spelling tests, measures of musical talent, students’ and teachers’ judgments. 
etc. Slocombe (215, 216) and Thurstone (236) discussed the same formula 
critically. Shen (213, 214) derived and defended a formula for the stand- 
ard error of predicted coefficients, which had been questioned by Holzinger 
and Clayton (133). 

Reliability and sampling—Talbott and Ruch (228) showed the effects 


upon reliability of intensive and extensive sampling. 


Statistical Treatments of Test Results 


Types of norms—In 1920 the prevailing type of norm was the grade 
median or, more rarely, the grade mean. There has been a gradual shift 
toward age norms and variability units such as T-scores, sigma indexes. 
P. E. values, and percentiles. 

Galton (124) and Woodworth (263) seem to have suggested the essen- 
tial idea of comparable measures such as Kelley’s standard measures (147, 


152), McCall’s T-scores (166), and Franzen’s sigma indexes (121). Mon- 
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roe (174, 175) has argued for the use of age norms rather than grade aver- 
ages. Numerous writers have discussed the advantages and limitations of 
different types of norms: Kelley (148), McCall (105), Monroe (175), 
Ruch and Stoddard (207), Willson (258), and Lindquist (158). 

Scaling of tests—The early development of handwriting, composition, 
and language scales turned attention to the scaling of tests. McCall (165, 
166), Van Wagenen (249) , and Trabue (248) published typical procedures. 

In a long series of papers, Thurstone (232, 234, 237, 239, 240) sought to 
present refinements of the earlier methods. His method of scaling sought 
to avoid the difficulties of the P. E. values which arise from unequal varia- 
bilities in successive age or grade groups. He re-scaled the Trabue language 
and Woody arithmetic scales by way of illustration. Holzinger (136) criti- 
cized certain of the assumptions basic to Thurstone’s method and appeared 
to favor the calculation of scale values from a single group of wide age 
range. Courtis’ isochron method (96) based upon the Gompertz equation 
also affords possibilities for scaling tests over wide age intervals. 

Weighting of test items 
individual test elements has been subjected to general attack. Douglass and 
Spencer (104) found correlations of .975 to .999 for weighted and un- 
weighted scores on four educational tests. As a result Douglass abandoned 
his weightings in revising his algebra tests. Holzinger (131) found un- 
weighted scores to correlate .99 with weighted values by two common 
methods. Corroborative evidence was published by West (254), Corey 
(95), Scates and Noffsinger (212), Odell (180), Ruch and Meyer (196), 
Potthoff and Barnett (189), and others. It is probably better to abandon 
weightings in tests as contrasted with scales proper. 





In the past ten years the practice of weighting 


The A. Q. technic—Franzen’s proposal (121) of the accomplishment 
quotient (A. Q.) in 1920 resulted in a flood of literature, at the outset 
favorable, but more recently largely antagonistic. It is to be noted that Mon- 
roe and Buckingham (173) and Pintner and Marshall (187) developed 
similar procedures independently. Difficulties in the A. Q. technic were soon 
pointed out by Toops and Symonds (247), Chapman (92), and Ruch (195), 
although Stebbins and Pechstein (222) defended it as a very valuable 
measure. Thomson and Pintner (230) demonstrated the danger of spurious 
index correlations in such quotients. Perhaps the most damaging evidence 
came from Symonds (223), Popenoe (188), and Odell (179), who showed 
the reliability of A. Q.’s to range from about .20 to .60 for existing mental 
and educational tests. Chapman (92), Herring (130), Huffaker (140), and 
others pointed out the necessarily large probable errors of quotients from 
unreliable measures. Foran (118) noted the typical finding that the relia- 
bilities of such quotients are lower than the scores themselves. Wilson 
(259), Douglass and Huffaker (103), Morley (176), Rand (190), and 
McCrory (167) have all found the A. Q. procedure unsatisfactory. In the 
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meantime Kelley (148, 150) developed a technic for comparisons . 
mental and educational ratings which appears to rest upon a sound statis. 
tical basis. It seems at this date that the A. Q. procedure is more likely 
be discarded than to be retained. 


Time Saving and Cost Reduction Devices 


Correlations by machines—Three devices for the calculation of correla- 
tions by machines have been reported, namely, the Hull (142), Dodd 
(102), and the Mendenhall-Warren-Hollerith methods (168, 251). The 
latter is an adaptation of the usual Hollerith sorting machine and is claimed 
to be the most rapid and economical device yet invented. 

Plotting devices and correlation charts—Symonds (227) brought to- 
gether as a convenient reference fifty-two variations of the Pearson Product- 
Moment Formula. Toops (246), Orleans (181), and Anderson and Toops 
(80) described simple apparatus for tabulating and computing correla- 
tions. Edgerton and Toops (113) simplified the calculation of intercor- 
relations by tables. 

Convenient correlation charts or scatter diagram forms have been placed 
on the market by Holzinger (132), Kelley (145), Otis (182), Ruch and 
Stoddard (200), Ruger (208), Thurstone (233), and Toops (245). 
Dvorak (107, 108) has a chart for the computation of eta from grouped 
data and assumed means. 

Tables, nomographs, abacs, and graphs—Holzinger prepared tables for 
the probable error of the correlation coefficient as found by the product- 
moment method (138) and, more recently, a set of tables for elementary 
statistical work (137). Edgerton and Paterson (114) computed a table for 
the sigma of a percentage. Cureton (100) published two tables to facilitate 
the computation of rho. Edgerton and Toops (113) constructed a table for 
determining increases of validity and reliability for lengthened tests up 
to n = 15. Edgerton (112) has a table for the probable error of R pre- 
dicted by the Spearman-Brown Formula. Masters and Upshall (161) tabled 
the probable errors of certain inter-percentile ranges. 

Dunlap and Kurtz (106) published a valuable handbook for statisticians 
in which appear twenty-eight nomographs, twelve tables, and a collection 
of the more useful formulae. 

Nomographs, abacs, or graphs have been prepared to facilitate computa- 
tions with common formulae by Cureton and Dunlap (98, 99), Griffin 
(128, 129), Edgerton (109, 111), Toops and Edgerton (244), and Rulon 
(210). 

Partial and multiple correlation methods—The increasing use of partial 
and multiple correlation methods in educational and mental analysis and 
prediction has encouraged the search for more economical methods of han- 
dling large numbers of variables. The method of Yule has been further 
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simplified, in turn, by Rosenow (194), Kelley (144, 152, 153), Huffaker 
(139). and Bathurst (81). 

Kelley and Salisbury (149, 211) presented an iteration method for suc- 
cessive approximation of regression weights for large numbers of variables. 
Tolley and Ezekiel (242, 243) claimed that the older Doolittle method is 
more economical. Kelley and McNemar (146) made the counter-assertion 
that the iteration method is more economical for ten or more variables. 
: Garrett (125) further simplified the Doolittle method and Peters and 


a 2 . . 
d | Wykes (186) prepared work sheets for this method. 
r 
, Critical Issues 
O- Critical issues looking toward the improvement of educational measure- 
ment might be discussed at almost any length. In view of space limitations, 
ps the reviewer will avail himself again of the opportunity to mention the 
a- very great significance of Kelley’s /nterpretation of Educational Measure- 
ir ments (148). Most of the essential issues and “next steps” are ably pre- 


sented therein. The following further comments may supplement Kelley’s 
ed discussions. 


1. There are in use today at least one thousand different educational and mental 
tests. Convincing critical and statistical data on the validity, reliability, and norms 
ed of these measures are available in probably less than 10 percent of the cases. The 
publication of such crucial information is an ethical obligation of the test author and 
publisher. Ruch (202) suggested the minimal requirements in such reporting. 


oI 2. In view of the situation just mentioned, there is an urgent need for comparative 
ct- studies of the relative values of existing tests. In most subjects, this need is probably 
rv more insistent than the production of new tests. Kelley and a group of experts par- 
i tially met the situation (148: 214-348). Gates (126), Monroe (170), Ruch and his 
. students (79, 101, 196, 203, 207), Mosher (177), and Broom and others (89) made 
ate scattered, individual attempts to evaluate tests of reading, history, geography, physics, 
for arithmetic, and other subjects of study. The gross unreliability of many published 
up norms was strikingly demonstrated by Adams (79), who found that 8 of the best 
, known arithmetic tests rated the mean performance of 152 pupils all the way from 
of fifth-grade to eleventh-grade achievement, depending upon which test was employed. 
led 3. The reliabilities of all but a few existing tests are far too low for the measurement 


of individuals, as contrasted with evaluation of groups. Monroe and others (172) 
ans found an average reliability of .67 for 21 standard tests. Ruch (204: 142-4) computed 
149 such reliability coefficients and found the central value to be .69. In view of 


ans Kelley’s standards of required reliabilities (148: 28-9), 131 of the 149 tests are service- 
able only for group measurement; about 10 are adequate as measures of individuals; 
ita- and only 5 or 6 will justify attempts to compare differences in achievement in different 
fin school subjects (A. Q.’s or more valid technics). These facts should dispose of the 
7 demand for shorter tests; longer and more reliable ones are indicated. 
n 


4. Mislabelling of tests is the rule rather than the exception in such titles as 
diagnostic and prognostic. Very few diagnostic tests show sufficient reliability of total 
ial scores for accurate measurement, not to mention the unreliability of the sub-tests 


and individually. Few prognosis tests predict better than a correlation of .60 and frequently 

subjectmatter and intelligence test data are already at hand which would enable equally 
an- as good or a better prediction, if properly weighted and combined. The reviewer has 
her considerable unpublished evidence in support of this contention. 
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5. There is urgent need for a fact-finding organization which will undertake impartial, 
experimental, and statistical evaluations of tests—validity, reliability, legitimate uses 
accuracy of norms, and the like. This might lead to the listing of satisfactory tests 
in the various subjectmatter divisions in much the same way that Consumers’ Research, 
Inc. is attempting to furnish reliable information to the average buyer. The reviewer 
has indeed attempted to initiate such a fact-finding project, but without success to date. 
Independent workers in this field are few as yet, the task is tremendous, and to leave 
such determinations to authors of tests and publishers is likely only to continue the 
present chaotic conditions. I 
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CHAPTER IV 


Recent Developments in Testing for Guidance 


[remizinc the findings of published articles in a given field of research 
may fail to indicate in proper perspective the larger problems that lie be- 
hind the bits of evidence which have found their way into print. In order 
to be able to present a broad view of what is being attempted in guidance, 
personal letters on the subject were solicited from approximately fifty 
American psychologists and educators who were known to be actively inter- 
ested in guidance. It will be understood, therefore, that points of view and 
quotations ascribed in this summary to certain workers without the citation 
of specific reference numbers were taken directly from these personal letters 
to the reviewer. 


Testing for Selection 


The distinction between the guidance of an individual and the selection 
of an employee or student for a particular type of work is gradually being 
recognized. The employment managers of factories and perhaps the ad- 
mission officers of many of our schools, are interested in selecting only those 
persons who can do the work successfully. This point of view, as Harry D. 
Kitson points out, is not that of guidance: 

In some occupations, with reference to certain specific jobs, some psychological 
tests have been able to select good workers with a satisfactory degree of accuracy. But 
this is not vocational guidance. It is vocational selection. To do vocational guidance 
we should try to help the unsuccessful applicants to find satisfactory vocations. 

Herbert A. Toops states that “the twelve thousand or so freshmen of Ohio 
colleges could be recruited twice over from high-school graduates who 
have at least a fifty-fifty chance of graduating.” It is probable that almost 
any commercial or industrial organization could, especially during the 
present period of economic depression, replace satisfactorily its entire staff 
from the ranks of unemployed persons by using available tests in their 
selection. Stanton (298: 43), in selecting students with tested musical 
ability for musical training, argued that “competent teachers are deserving 
of the best talent we can give them. Why should a school obtain the best of 
teachers and give them any pupils who wish to study, poor talent as well 
as good talent?” Tests and other means of selection for many types of work 
are now available, although relatively little has yet been done with them 
by most schools and industries. 

One good illustration of effective research with regard to selective devices 
is the work with placement tests that has been going forward under Stoddard 
(300, 301) and Seashore (294, 296) at the University of Iowa. Miller (283: 
112) reported that “reading comprehension tests which demand the under- 
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standing of the application of principles and which are based upon the 
particular subject to be taken constitute the best single measure of aptitude 
employed in these examinations.” Viteles (310: 200-322) presented an ex- 
cellent summary of much of the research that has been done on the selection 
of workers in industry and business. 

In all of this research on selection, it is becoming increasingly evident 
that objective measures should be substituted where possible for the less 
reliable criteria that have commonly been employed. As Richard D. Allen 
says, “Every day brings evidences of the danger of guess work in cases 
where measurement is possible.” Even verified employment records, show- 
ing the number of years an applicant has worked at a given job, are sur- 
prisingly lacking in their validity as measures of his skill or ability to do 
the work. In examining the academic abilities of 282 unemployed persons 
in Minneapolis who had graduated from high school but taken no addi- 
tional training, more than 1 percent of them were found to possess fifth- 
grade ability, 2 percent sixth-grade ability, 3 percent seventh-grade ability, 
6 percent eighth-grade ability, and so on. Equal lack of real validity was 
found in occupational experiences of men who had worked from ten to 
twenty years in the same occupation. Men who have earned their living for 
years as machine-tool operatives sometimes possess less actual skill and less 
actual knowledge of their tasks than youngsters who have had only a few 
days of experience. 

Commenting upon officials who fail to recognize this lack of equality in 
ability among persons who have had equal opportunities to acquire it, 
Remmers (293: 28) remarked that they “are concerned with what goes into 
the process and not with what comes out.” A similar note appears in Link’s 
(282) comment that “the cultural value of education resides not so much 
in the courses chosen, as in what these courses do for the individual.” 
Industry and education are gradually learning that it is much safer to rely 
upon objective measures of present ability than upon mere records of time 
served at a certain type of work. 


The Scope of Guidance 


It is possible that millions of workers are earning a living at tasks to 
which they are less well adapted than they would be to certain other tasks. 
Most of these persons, poorly adjusted from the point of view of individual 
success and happiness, are probably fairly efficient from the point of view 
of business and industry. They may be well selected, but they have not been 
well guided. In guidance, as Viteles (311: 339) remarked, “the point of 
orientation must always be the individual and his future.” 

It will never be possible in a rapidly changing world to have everyone 
working at the particular task which he can do best, but that fact does not 
excuse us from making serious efforts to approach such an ideal. Any 
organization of society which fails to contribute to the personal adjustment 
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and happiness of its citizens will gradually disintegrate, but every contribu- 
tion made to the personal satisfaction and contentment of its individual 
members will add strength and life to the organization which makes the 
contribution possible. Guidance aims, therefore, not only to help the in- 
dividual to find his appropriate place in society, but also to strengthen 
society itself by developing greater satisfaction and loyalty in each of its 
members. 

In the practice of guidance one must consider one’s work incomplete until 
the most appropriate adjustments possible have been indicated for each 
individual. Ignorance of some important factor in the nature of the individ- 
ual cannot release one from one’s responsibility for making an incorrect 
diagnosis. It is obvious that no adviser can ever know all about every trait 
possessed by those individuals to whom he attempts to give guidance, but, 
here again, the inability to attain perfection should not prevent us from 
attempting to approach it as closely as possible. Every aid that science can 
give to an individual in seeing himself and his own characteristics more 
clearly, and in understanding more fully the place in which his talents 
would be of greatest value to himself and to society, is certainly worth the 
effort involved in providing it. 

From this point of view, guidance and education are very intimately 
related. As a matter of fact, educational guidance, social guidance, emotional 
guidance, vocational guidance, and all other desirable types of guidance 
are merely different phases of a single program whose purpose is to build 
the happiest and most fully integrated personality possible upon the founda- 
tion with which nature and previous experience have provided the individ- 
ual. The principles of guidance are the same in all fields. While occupa- 
tional guidance is most often discussed, it is only one phase of the total 
process, and it should not be viewed as an independent task. Occupational 
guidance may be used, however, to illustrate the problems and procedures 
that characterize the entire field. 

Success and personal satisfaction in a given type of work involves the 
presence of a distinctive combination of abilities, interests, preferences, 
and other personal characteristics. While the possession of a few special 
traits may be indispensable in the occupation, there are other traits which 
add materially to the general excellence of the individual’s adjustment. 
Definite knowledge of the distinctive patterns of traits which characterize 
the successful workers in each occupation is essential in one who would give 
occupational guidance. 

The systematic determination of these patterns in an objective manner 
has made relatively little progress. The work of the Employment Stabiliza- 
tion Research Institute of the University of Minnesota (280, 288, 308) has 
indicated one of the methods by which such determinations might be made. 
Strong (302, 303) at Stanford University and other workers (268, 269, 
306) have been determining the interest patterns of successful persons in 
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various professions, but there is just as great a need for reliably determined 
patterns of abi.ity, patterns of physical condition, patterns of attitude, and 
patterns of social behavior. Until these distinctive patterns are made avail- 
able, much of the vocational guidance offered will continue to be in the 
nature of “the blind leading the blind.” 

In addition to knowing the distinctive patterns of characteristics pos- 
sessed by successful and well-adjusted workers in various occupations, the 
person who attempts to give vocational guidance should also know how 
these well-adjusted workers came to have these patterns. How early in life 
do various features of these patterns appear? To what extent are the char- 
acteristic traits of successful workers developed by training and experience, 
and to what extent are they the.result of original nature? The answers to 
these questions must be sought by means of cumulative records for individ- 
uals. Objective records of test results during the elementary-school period. 
during the high-school period, and during late adolescence must be studied 
in relation to equally reliable records of the training and experience re- 
ceived at different periods by these same individuals. Analyses of such 
cumulative records will show how early in life mechanical abilities, clerical 
aptitudes, artistic appreciations, submissive personalities, and selling inter- 
ests crystallize sufficiently to be truly indicative of their ultimate develop- 
ment. At the present time we have very little reliable information regarding 
these critical issues. 

This discussion calls attention rather sharply to the fact that very little 
evidence has yet been found to indicate that educational tests have any great 
value in vocational guidance. It is true, of course, that one’s scores in 
objective educational tests have recognized usefulness in predicting one’s 
probable success in school and college courses, but these values are onl) 
indirectly concerned with vocational guidance. Consistently low scores in 
educational tests would generally be accepted as indicative of probable 
failure in law, engineering, and other professions; but very little is yet 
known regarding the specific patterns of educational test scores character- 
istic of those persons who later become successful in various occupations. 

Here again, a broad and fruitful field of research awaits the investigators 
who can identify a sufficient number of successful, well-adjusted workers 
in different occupations and discover from their cumulative school records 
the test-score patterns which characterized them in early life. If the records 
are sufficiently full to show the specific educational and occupational ex- 
periences which these persons met between the time they took the tests and 
the time they achieved success and happiness in their occupations, significant 
additions will be made to our knowledge of desirable educational pro- 
cedures as well as to our technics of guidance. 


Forward Steps in Guidance 


The demonstration in the World War of the usefulness of so-called “gen- 
eral intelligence tests” (290, 315) led to extensive experimentation with 
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mental tests in vocational and educational selection. The limitations of such 
measures are beginning to be determined, and the tendency at present is to 
use in their stead measures of specific types of ability. Stoddard (301: 23) 
reported that “a number of placement examinations lead to a profile of 
one’s mental-educational skills, which in the case of adults is more intelli- 
gible and more significant than a single measure, such as I.Q.,” and (301 : 
92) that “partial and multiple correlations demonstrated the superiority 
of placement examinations over high-school achievement and the traditional 
intelligence test as a device for predicting college success in a subject.” 
Comparable results have been obtained by other investigators of educa- 
tional selection (276, 293, 298) and by practically all those who have 
investigated problems of vocational selection (273, 308, 311). 

The large number of different traits that characterize #uccessful workers 
in each different occupation has made it evident that vocational guidance 
cannot confidently be offered to an individual without a very wide sampling 
of his characteristics. The Employment Stabilization Research Institute of 
the University of Minnesota (280, 288) in their examinations of approxi- 
mately four thousand unemployed persons used a uniform program of 
individual tests and examining schedules requiring of each individual 
approximately six hours, supplemented by such special, examinations and 
follow-up studies as were found necessary. The Minnesota examinations 
included personal, social, educational, and occupational histories, carefully 
checked by well-trained industrial-social case workers and cleared through 
the confidential exchanges of the community social agencies; complete 
health and physical examinations, supplemented by routine fluroscopic 
and bio-chemical tests; measures of physical strength, visual and auditory 
acuity, color blindness, and the like; tests of academic achievement, clerical 
aptitudes, mechanical abilities, dexterity in the use of hands, fingers, and 
small tools; and measures of interest patterns, likes, dislikes, and person- 
ality traits. Trade tests of skill, knowledge, and appreciation were also used 
freely. 

Link (282) reported that Hanna found “from five to fifteen hours time 
essential to an adequate vocational analysis,” and that Viteles found “five 
hours the minimum time necessary in the simpler cases.” Viteles himself 
(311: 335) stated that “underlying the work of this guidance clinic is the 
point of view that adequate guidance involves a consideration of all the 
psychological, social, economic, and physical factors which may affect the 
progress of an individual in a vocation.” 

Interpretation of all these data regarding an individual, even if one were 
adequately supplied with complete information regarding the distinctive 
patterns of traits characteristic of each possible occupation, would be a task 
calling for the utmost of mature judgment and sagacity. Heller (272:437) 
in reporting on the best practice of industrial psychology in Switzerland, 
said, “The psychological diagnosis attempts to survey the total personality 
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of the individual.” Viteles (311:338) declared that “the problem of guid- 
ance is that of weighing every element in the situation—of establishing 
balance between the highly complex interrelated variables which influen 
individual adjustment.” The seriousness of this responsibility leads R. S. 
Uhrbrock to ask, “By what right does a vocational counselor, for the most 
part unversed in clinical methods or in statistical technics, administer tests 
and then undertake to interpret the results and give vocational advice?” 
Brotemarkle (267:258) also believed that “psychometricians are not ade 
quately trained to give the analytical diagnosis basic to the solution o! 
individual problems.” 

Definite progress is being made in determining the statistical reliabilit 
and validity of various specific measures. There would be-a certain amount! 
of advantage scientifically if those who devise tests could register them and 
have them assigned certain numbers rather than names. O'Connor (286), 
for example, preferred to discuss “Work-sample No. 16” rather than th 
“Finger Dexterity Test.” Any name assigned to a test leads those who hea: 
it to expect from the test other types of information than it can possibly 
provide. The painstaking use of correlations (273, 287, 300), tetrad differ- 
ences (297), path coefficients (314), and iteration methods (277) in teas. 
ing out the real meanings of test scores will ultimately be of great value in 
guidance, since it will make clear to us the degree to which different tests 
are measuring the same traits. 

The development of tests for other traits than achievements and abilities 
is making relatively slow progress. Hartshorne and May (271) made sig- 
nificant progress in measuring social and ethical behavior, and Thurstone 
(307) did valuable work in measuring attitudes. Strong’s (303) testing 
of vocational interests was a contribution of marked importance. Mary H. S. 
Hayes, for example, writes that in her judgment “Strong’s interest tests 
come nearest to being the most significant development, in spite of the fact 
that they are not applicable to the younger group.” 

There is great need, however, for more objective tests of determination 
to succeed, willingness to sacrifice immediate comfort for ultimate success, 
physical attractiveness, and other traits that are not closely related to mere 
abilities. Daniel Starch writes, for example, that “achievement in occupa- 
tions depends to a larger extent upon such qualities as initiative, aggressive- 
ness, and industry than is commonly realized. Tests to date have not ap- 
parently measured these qualities with sufficient reliability.” 


Other Critical Issues 


One of the difficult tasks which is slowing up the progress of testing for 
guidance is that of identifying well-adjusted individuals in each occupation. 
The so-called “democratic organization” of society in America tends to 
disturb even those persons who are quite happy in their work, since some 
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of them are subject constantly to the temptation of wishing for social pres- 
tige, greater political power, or larger financial return, even though they 
may not be at all typical in their trait patterns of the persons who are well 
adjusted in these other positions which they covet. In setting up standards, 
we must be very careful to select as representative of each occupation, 
therefore, only those who are clearly adapted to their positions. 

Another difficulty lies in the lack of adequate organization for obtaining 
the patterns of traits typical of well-adjusted persons in the various walks 
of life. An individual research worker can do relatively little by himself, 
and what he does is in danger of being only local in its significance. A 
national clearing-house and research center with adequate funds for the 
nation-wide determination of occupational patterns, each with its local 
variants, is very greatly needed. Such an organization would also be of 
tremendous public service, in revising our scheme of occupational classi- 
fication, which years of social and technological changes have rendered 
quite obsolete. 

One of the problems on which further experimentation is necessary con- 
cerns the methods one may use to represent the patterns or combinations of 
traits in an individual or in an occupational group. A graphic profile has 
been used (308), and Hull (273) has proposed a differential index to be 
derived statistically by multiple correlation technics. 

Whatever the method used in recognizing the individual’s pattern of traits, 
psychological insight into the individual’s whole personality, and a full 


understanding of the working conditions in the various occupations open 


to him, should be possessed by his occupational adviser. This means, among 
other things, that the vocational adviser must be selected by means of the 
most rigid selective devices available and then trained, not only in the 
clinical methods of vocational psychology, but also in first-hand contacts 
with the actual conditions in business and industry. As R. S. Uhrbrock 
says, “Any movements that provide first-hand vocational experience for 
counselors are worth fostering.” 

Still another problem that must be attacked cautiously is that of de- 
veloping public confidence in the value of information obtained by these 
technics. It is easy to claim too much validity for a diagnosis; yet millions 
of dollars are being thrown away annually by persons who are consulting 
palm readers, phrenologists, graphologists, and physiognomists in serious 
efforts to obtain better adjustments to their occupations. Objective tests, if 
properly interpreted in the light of scientific studies, provide the soundest 
basis for really valid guidance, but the results of two or three tests on a 
given individual should not ordinarily be considered adequate. There is no 
short-cut to a valid occupational diagnosis. 

Vocational guidance should never become vocational control, and yet 
some effective way must be discovered to persuade unadjusted persons to 
consider seriously the results of suggestions growing out of careful occu- 
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pational diagnoses. Traditions and beliefs regarding the greater social] 
dignity of certain occupations are hard to overcome. Harry J. Baker says, 
“It seems to me that one of the most crucial problems in guidance is ways 
and means of changing and molding the vocational aspirations of dull and 
average individuals into lines of activity within their abilities and to d 
this without undue shock and with a positive rather than a negative psycho- 
logical approach.” 

The reviewer has felt for many years that one of the first steps in break 
ing down popular prejudices against certain kinds of work is to abandon 
the use, in discussing workers and occupations, of such adjectives as dull, 
inferior, superior, intelligent, and the like. From a social point of view the 
garbage collector who likes his work, and who has all the appropriate 
characteristics for that particular job, is a much better citizen than the 
man who, with the characteristic trait patterns of a mule driver, is trying to 
manage an employment office or banking business. Perhaps it is impossible 
in a capitalistic economic system to substitute the concept of adequacy of 
personal adjustment for that of adequacy of financial standing as a basis 
for social prestige, but we should certainly do everything possible to remove 
artificial stigmas of every kind from all types of useful service. 





CHAPTER V 


Recent Developments in the Uses of Tests 


SCOPE OF TESTING MOVEMENT 


Tue scope of measurement in education is probably not yet generally 
appreciated. When one considers the kind and number of research studies 
in education since the advent of the testing movement in contrast with those 
of the previous two or ten decades, the realization of the potency of this 
new instrument is inescapable. Testing procedures are now a matter of 
course in the attack on educational problems everywhere. Twenty years ago 
tests were novelties—technics of investigation consisted largely of the com- 
pilation of opinions. Today the use of educational tests has become almost 
as commonplace as that of textbooks. In the more progressive school sys- 
tems, teachers utilize various forms of educational tests continuously. 
Thousands of studer.ts of education are making use of this relatively new 
device. The ultimate purpose may, in general, be said to be the improve- 
ment of instruction. This is often sought indirectly through changes in 
administration, in organization of schools, in classification of pupils, in 
educational and vocational guidance, and so on; but as a rule some form 
of measurement constitutes the basis. Standardized and unstandardized edu- 
cational tests have thus, in large measure, become everyday working tools 
for teacher, principal, and superintendent. Many investigations, of course, 
are never published, but a considerable number of bibliographies have 
already been compiled. A list published by Monroe and his associates 
(396) in 1928 contained 3,650 references to educational researches. A 
large number of these involved the use of tests. The Education Index for 
January, 1929, to June, 1932, (349) alone listed 139 articles under “Tests 
and Scales,” 7 under “Achievement Quotients,” 36 under “Achievement 
Tests,” and 55 under “Educational Measurements.” Many of these references 
are themselves compilations of articles on the same topic. The United States 
Office of Education has published annually since 1926-27 a Bibliography 
" of Research Studies in Education (451, 452, 453, 454) which includes hun- 
dreds of studies utilizing tests as the essential instruments of research. 
Two hundred and fourteen studies out of a total of 4,651 for 1929-30 were 
classified under “Testing and Research” (454). A large number of other 
studies listed in this single bibliography made use of tests. The University 
of Illinois has also published an index of indexes (394), which includes a 
large number of studies based on educational tests, and an annotated bibli- 
ography of graduate theses in education (364). The Teachers College Record 
for January, 1932, (464) contained a bibliography on sources useful in 
determining research completed or under way, including the studies of 
the National Education Association; the United States Office of Education; 
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the sources for locating research undertaken by individual institutions, such 
as the larger universities; theses and dissertations; and the abstracts found 
in the past issues of the Review of Educational Research. Such summaries 
of research in special subjects as those prepared by Gray (361, 362), Bus- 
well (331, 332), Monroe and Engelhart (392), and Lyman (384) likewise 
presented annotated bibliographies describing research studies in which 
large use was made of educational tests. 

This wide range of references is evidence of the growth of the testing 
movement throughout the United States in little more than a decade. So 
vast a literature cannot be reviewed in the brief space here available. Under 
the circumstances it seems a more practical procedure to cite typical con- 
tributions without any attempt to be exhaustive. Probably as much as a 
“five foot shelf” of texts specifically addressed to the uses of tests has 
appeared since the beginning of the test movement. Among the chief recent 
contributors may be mentioned Monroe (395), McCall (388), Trabue 
(445), Courtis (341), Gilliland, Jordan, and Freeman (359), Pressey and 
Pressey (414), Ruch and Stoddard (425), Wilson and Hoke (463), Smit! 
and Wright (430), Orleans and Sealy (407), Odell (406), Greene and 
Jorgensen (363), Ruch (423), Kelley (378), Hull (372), Van Wagenen 
(455), Hildreth (368), Madsen (386), Russell (426), Michell (390), and 
Tiegs and Crawford (443). 

Generally speaking, these authors have addressed themselves to the 
use of tests. Achievement tests have received considerable attention. The 
Measurement and Adjustment Series of texts edited by Terman, which in- 
cludes some of the authors already mentioned and in addition such writers 
as Dickson, Fenton, Goodenough, Otis, Stidham, Wells, and Wood, deals 
with the general problem of pupil testing and adjustment. A closely related 
series of statistical texts also dealing especially with educational measure- 
ments and their interpretations has appeared. Among these may be men- 
tioned books by Otis (411), Holzinger (371), Garrett (354), Lincoln (383), 
Thurstone (442), Macdonald (385), Kelley (378), Dunlap and Kurtz 
(346), Odell (406), and others. All this literature has appeared subse- 
quent to such basic studies as those by Galton (353), Cattell (334), Thorn- 
dike (441), and Terman (438, 439, 440). 


TYPES OF USES OF TESTS 


A mere listing of ways in which educational tests affect educational theory 
and practice serves to emphasize the recent wide influence of this relatively 
new technic. Among such uses may be mentioned the following: 


(1) Determining and evaluating administrative policies, including the classification 
of pupils, provision for individual differences, standardization of teachers’ marks, 
curriculum construction, and supervisory activities. 

(2) Setting up objectives and evaluating the products of the educational program. 

(3) Evaluating methods of teaching. 

(4) Improving learning through a discovery of learning difficulty, the sources of 
motivation, and the uses of self-teaching test materials. 
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1. Determining and Evaluating Administrative Policies 


Educational tests, as a measure of the effectiveness of administrative 
policies, got under way in the surveys made either by local school boards, 
by special commissions, or by bureaus of research. In the 194 city school 
surveys made since 1910 and reported by Caswell (333), more than half 
involve the use of tests as basic data for evaluating the school system. In 
the Baltimore survey of 1921, for example, the status of pupils revealed by 
standardized arithmetic tests called attention to the fact that too much time 
was being given to this subject and too little to other subjects. The recom- 
mendations of the surveyors based on test data led to a change in the ad- 
ministrative policy in this respect. Local surveys, such as the semi-annual 
instructional survey conducted by the Baltimore Bureau of Research (433), 
have influenced practically every phase of the school system. This is like- 
wise true in Providence, Philadelphia, Cleveland, Detroit, and other cities. 

Classification of pupils—It was inevitable that the possibility of more 
accurate measurement of the educational product of the classroom should 
lead to frequent uses of educational aptitude tests in the sorting and re- 
sorting of pupils within schools and within classes. While intelligence tests 
have from the beginning served as the more common means for so-called 
homogeneous grouping, the many controversial issues involved have served 
to deflect more and more interest toward achievement tests whose signifi- 
cance is at least thought by most persons to be less uncertain. Grouping 
pupils in and within classes on the basis of achievement scores in successive 
subjects has become a typical use of educational test data. The reports of 
555 superintendents to the National Education Association (402) indicate 
how frequently this practice is followed. A detailed statement as to how 
three cities use this procedure is shown by Baltimore, Colorado Springs, 
and New York City. Chism (335) reported that 67.5 percent of the ele- 
mentary schools in 490 cities with a population of 2,500 to 100,000 use 
standardized educational tests as a basis of classification. 

As this practice developed, questions arose as to the relative merits of 
several methods of classification, and again educational tests became effec- 
tive means of aiding the evaluation. Burr’s examination (329) of the edu- 
cational achievements of homogeneous groups with emphasis on variability 
as a supplement to central tendency, concluded that there was great over- 
lapping of achievements of groups as sectioned in the six cities studied. 
Purdom (416) found that first year high-school pupils do not gain more 
in English and algebra than pupils in heterogeneous sections when the re- 
sults are measured by standardized tests. Keliher’s investigation (376) 


called attention to the effects of homogeneous grouping not measured by 
educational tests. Hollingshead (369) evaluated the use of certain educa- 
tional tests and mental measurements for purposes of classification. 


As the problems of classification loom up, prognosis becomes an end 
for using tests. Courtis (341) studied how reliably the success of a child can 
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be predicted from the measurement of a few basic factors with existing 
tests and scales. He reported that boys within the age range and school con- 
ditions studied have been proved to succeed in their school work in differ- 
ing degrees primarily because of differences in the maturity or development 
factor best represented by age. The comment is also noteworthy that approx- 
imately 7 percent of the present errors of prediction of Stanford scores hay; 
been proved to be caused by imperfections in the measuring instruments 
themselves. In checking the relative values of individualized versus group 
instruction when the pupils under both plans go on to higher schools, Wash- 
burne, Vogel, and Gray (459) found the results indicate that the mastery 
of the fundamental facts in arithmetic, reading, and language as measured 
by standardized tests is facilitated somewhat for most pupils by the Win- 
netka technic. One classification problem, namely, class size, has been 
thoroughly investigated by Smith (429). Her contribution is significant, 
not only in its resulting information concerning the conditions under which 
large classes can be effectively handled, but also in her skillful use of tests. 
In addition to the series of tests incident to the pairing of the groups, a pro- 
gram of achievement testing was carried on throughout the year. Pre-tests 
were used in many units of work, and quarterly tests in the common English 
skills. Where standardized tests were available, they were used in alternating 
forms at the beginning and at the end of the year and, in some cases, at the 
end of each quarter. Where standardized tests were not available, objective 
tests based upon the particular content of the course were devised. 
Provision for individual differences—Growing out of classification studies 
comes the realization of wide-spread individual differences. In the pro- 
cedures used for providing for these differences, educational tests function, 
not only to discover differences and to evaluate procedures, but also as an 
integral part of the teaching technics. The National Society for the Study of 
Education (403:xii) , for example, called attention to the fact that complet: 
diagnostic tests need to be prepared on each unit of achievement, and that 
self-instructive and self-corrective practice materials are to be prepared to 
enable children to get ready for the tests individually or to repair de- 
ficiencies shown by the tests. Buckingham (327) set forth a program of 
individualized instruction on the basis of testing. Washburne (456) sup- 
ported his philosophy of individualized instruction with test data showing 
that high-school pupils from Winnetka’s individualized schools when com- 
pared with pupils from other localities are above the average of all other 
pupils in four classes. Using educational tests as one basis in equating 
groups and as a measure of growth, Broening (325) made an experimentally 
determined study, which revealed that individualized technics based on in- 
itial test data and self-corrective test materials brought about greate: 
achievement among junior high-school pupils in geography. Elective courses 
and preventive classes utilizing educational test data, either to select indi- 
viduals or to measure achievement, have also been developed to provide fo: 


individual differences (319). 
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Remedial teaching—An essential step in diagnostic or remedial teaching 
is a skillful use of educational tests. Sangren (427) gave a thorough-going 
report on the use of tests in the improvement of reading. Gates (355, 356, 
357, 358) gave a detailed account of a system of measuring achievements, 
diagnosing difficulties, and conducting test-determined instruction in read- 
ing. Objective test data are presented indicating the value of the procedures 
set up. Similar work was conducted by Courtis and Clapp in arithmetic. 
Metcalf (389) emphasized the need for using scientific testing in the analysis 
and classification of pupils in an effort at correct educational guidance. 
Dvorak and English (347), in measuring the efficiency of remedial teach- 
ing based on the Stanford Achievement Tests, found that their program of 
test-determined teaching produced from .99 to 2.3 years more than the ex- 
pected or regular .5 year growth; and claimed that the reason the pupils 
in their school had been so retarded was that though the teachers knew in a 
general way that the pupils were handicapped, they lacked the scientific 
and quantitative information which only the standardized survey and diag- 
nostic tests can give. Monroe (391) made use of tests in developing meth- 
ods of diagnosis and treatment of cases of reading disability. 

Promotion of pupils—Ample evidence shows that the technic of advanc- 
ing pupils to a higher grade is still far from objective. One of the yet un- 
solved problems is the weight which should be attached to the results of 
objective tests where available. That achievement tests, however, should play 
a part in determining promotion seems obvious, but there has been, appar- 
ently, little progress as yet in the field of test-determined standards of 
promotion. The beginning made in Baltimore, reported by Kramer (381), 
Douglass (345), and Frazee (352), indicates how, as a result of the city- 
wide programs of testing in arithmetic and reading, standards of attain- 
ment have been set up for the skill subjects in the elementary grades and 
in some subjects on the secondary level. These test data assist teachers in 
determining which pupils have made sufficient achievement in the skills 
indicated to warrant their promotion at least in the subject concerned. By 
allowing a deviation below the standards by as much as a half grade, the 
danger of injustice is removed, and at the same time the evils of courtesy 
promotions and low standards on the part of individual teachers are elimi- 
nated. The plan has further administrative advantages in that it fixes re- 
sponsibility directly on principal, supervisor, and teacher for the scholastic 
standing of their school. For those pupils to whom the regular offerings 
are unsuited, a special program suited to their ages and abilities is indi- 
cated, and the administration of this program is greatly facilitated by the 
application of objective standards. Although the operation of the system 
of standards here described is far from automatic and objective, it is well 
removed from the highly subjective conditions which obtained twenty 
years ago. 





The Ninth Yearbook of the Department of Superintendence offers (402: 
56-64) some excellent suggestions concerning means by which the class- 
room teacher may reduce failure. The three items of greatest frequency 
directly concern the use of tests: 


1. Use achievement and diagnostic tests followed up by special help and remedial 
work—test for deficiencies and diagnose pupil difficulties in each subject. 

2. Give individual attention to pupil needs and interests. This directly implies some 
objective measuring instrument to determine needs. 

3. Group according to ability, differentiate courses of study, and apply teaching 
methods suitable to each ability. 


In response to the inquiry, “What means do supervisors find most success- 
ful in bringing about a wider application of acceptable principles relative 
to pupil promotion?”, fifty-five selected supervisors who were asked to 
contribute to the Ninth Yearbook made most frequent reference to the use 
of standardized tests to supplement teacher judgment. Collier and Miller 
(337) and Tyndall (449) also showed the use of achievement tests in the 
solution of promotion problems. 

Standardization of teachers’ marks—Ruch (423) proposed the use of 
standardized achievement tests to stabilize the marking system of a schoo! 
organization. If the normal curve is used in marking, “in the first place we 
should disabuse ourselves of the idea that the normal curve (or any other 
mathematical concept) will tell us exactly how many pupils should receiv: 
A, B, C, etc.” However, in systems using five letters, the approximate dis- 
tribution, based upon “the assumption of chance distribution of pupil abili- 
ties” is: A’s, 6 percent; B’s, 25 percent; C’s, 38 percent; D’s, 25 percent: 
and E’s, 6 percent. Such a distribution would hold for large numbers of 
pupils, and teachers are justified in objecting to its mechanical application 
to small classes. Marked departures, however, should be based upon demon- 
strated variation from normal conditions. Over a long period, marks should 
approximate the normal distribution and marked variations may be ques- 
tioned. Ruch makes a number of significant proposals in marking, among 
which is this statement: “Give a standard test as a check on grades given.” 
Taylor’s study (437) shows that where objective teacher-made tests are 
used, the teacher’s marks are more reliable than when only essay tests are 
used. 

Supervisory activities—Present-day supervision in the better school sys- 
tems rests largely on test-determined achievement. The tools of measure- 
ment are continually being utilized as one of the most effective means of 
supervision. Barr (318) brought to light the need of measuring devices 
which will harmonize with the tenets of the new education and gave a help- 
ful discussion of the use of test data in a supervisory program. While many 
inferences drawn from test data by supervisors are probably unscientific, 
it may be said that a definite beginning has been made in this technic. In 
the Baltimore school system, for example, copies of all test results (ob- 
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tained regularly at the beginning of each term) are furnished to the super- 
visory force. The achievement of each class and of each pupil is scrutinized 
by the supervisor before making visits to the school, and recommendations 
for improvement of instruction are based largely on the test results (420). 
Cutright (344) showed how educational tests were used in developing scien- 
tific instructional material, in the determination of an effective teaching 
method, and in equating groups of pupils used in each experiment. Burton 
(330) reported the Detroit study of 1918, in which supervision was evalu- 
ated by measuring children’s achievement through standard tests, Crabbs’s 
1925 investigation in measuring efficiency in supervision and teaching 
through the use of educational tests, and Brueckner’s study of work type 
reading—a test-determined supervisory program. Seaton and Pressey (428) 
indicated how a report of the results of the tests given in October, 1931, 
was prepared in order that teachers and principals may have a guide to 
aid them in improving the work in composition in their schools. Knight 
(379) showed the uses of tests in surveying instruction. Coy (342) re- 
ported a study of the use of the accomplishment quotient in grades 3A to 
6A as a measure of teaching efficiency. Michell (390: 169-175) brought 
out the teaching values in new-type history tests. OBrien (405) indicated 
how tests were made a part of a supervisory program in geography and 
history. Holroyd (370) discussed a supervisory project in educational 
measurement. 

Curriculum construction—Perhaps the most interesting recently devel- 
oped use of tests is in connection with curriculum research. This type of 
investigation reaches into the selection of subjectmatter, grade placement 
of units of experience, and methods of organizing the curricular offerings. 
The Sixth Yearbook of the Department of Superintendence (401 :325) recog- 
nized the place of educational tests in curriculum research. Burch (328) 
utilized silent reading comprehension tests in the determination of the con- 
tent in literature suitable for junior and senior high-school students. Col- 
lings (338) used standardized educational tests as part of his basis for 
equating groups. Standardized tests were also used as a vital part of the 
measurement of outcomes in the experimental and control groups, proving 
the advantages of the project curriculum. Guiler (365) revealed important 
data which are being used by curriculum workers when he analyzed the 
results of the O'Rourke Test given to 240,000 pupils. Harap (367) showed 
a need for formulating tests and practice material as part of curriculum 
programs. Irion (375) indicated a successful attack on curriculum prob- 
lems when he analyzed literary comprehension into five elements, devising 
tests of each element for four different types of literature, using intelligence 
and standardized reading tests for comparison. Washburne (456) reported 
a five year study on adjusting the arithmetic curriculum to the child, “foun- 
dation arithmetic tests” being used in the initial and final measurements 
which determine allocation of arithmetic topics. He also described the use 
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of standardized reading tests in determining the difficulty of the reading 
material to be given the individual child at any particular stage of his 
development. Wood and Freeman (465) used educational and intelligence 
tests in determining the influences of the typewriter in the elementary-schoo!| 
classroom. Adams (316) used educational tests as a background for try- 
ing out the geography and history in the intermediate grades. Broening 
(323) used cumulative test data and initial and final test scores in the co- 
operative English research studies undertaken in Baltimore. Rankin (418) 
presented survey technics for the experimental determination of the value 
of materials and methods. Rolker (421) studied the spread of ability in 
arithmetic and its relation to standards of promotion and course of study 
revision. Bamesberger (317) compared outcomes of two types of social 
science courses of study through controlled investigations in a limited sit- 
uation. Standardized reading tests were used to equate the two groups 
studied, the outcomes being measured by carefully constructed objective 
tests on the sections of the subjectmatter in question. 


2. Setting Up Objectives and Evaluating the Products of the Edu- 
cational Program 


Eells (350) found in his study of the tests used in seventy-two published 
school surveys that arithmetic and reading are the subjects in which tests 
are most often used. General intelligence, spelling, and penmanship are 
next in frequency, and no other subject is tested in half the surveys. Hence. 


over a period of years the availability of standardized tests controlled to 
a large extent what objectives of education were objectively measured 
A case in point is the development of the technics of silent reading as a 
major goal of reading instruction. As educators became more critical of the 
direct relationship between what is measured in education and what re- 
ceives emphasis in teaching, steps were taken (1) to prepare standardized 
tests covering more of the objectives set up, and (2) to utilize the objective 
test technic in the measurement of the entire list of acceptable objectives. 
Evidence of the first named use of educational tests, to measure the less 
formal objectives of education, is recorded in the Tenth Yearbook of the 
Department of Superintendence of the National Education Association 
(400) in which a careful evaluation of available tests for character edu- 
cation is made. The June, 1932, issue of the Review of Educational Research 
(461), likewise, presented important research investigations in the field of 
tests of personality and character. Tyler (446, 447, 448) clarified the prob- 
lem of the construction of achievement tests which will really measure the 
entire range of objectives set up in the educational program. Spencer (432) 
reported improvement of teaching by means of home-made, non-standard- 
ized, diagnostic tests and remedial instruction. Ruch (424), Mann (387), 
and Smith (431) contributed studies which reveal the relationship between 
testing and the objectives of education. 
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3. Evaluating Methods of Teaching 


The improvement of educational tests facilitated the careful comparison 
of various methods of teaching and learning. Studies of this character are 
so numerous that only a sampling need be given to indicate how tests are 
used as basic data in equating groups and as measurements of achievement 
brought about by the method under study. Zirbes’ (467) comparative 
studies of current practice in reading showed uses of tests as a means of 
appraising procedures for improving teaching. Broening (324) used tests 
in literary appreciation as a measure of growth due to a method of teach- 
ing literature, as well as reading, intelligence, and appreciation tests for 
equating the groups of pupils used in the experiment. Coryell (339: 4-5, 
25-32, 43) used tests in equating groups and as measures of growth due 
to the use of the experimental factor in her comparison of intensive and 
extensive teaching of literature. Some implications of importance are 
drawn from her experiment regarding the use of objective tests of literature. 
Raguse (417) analyzed quantitative and qualitative achievement in: first- 
grade reading. Hanna (366) used educational tests in evaluating three 
methods of problem solving. Newlun (404) utilized objective tests in 
measuring the values of ability to summarize and achievement in the social 
studies. Field (351) presented a comparison of reading test scores to prove 
that extensive individual reading and class reading are desirable procedures 
to use in teaching of reading in grades three and four. 


4. Improvement of Learning 


The last ten years has seen an important development in the use of tests 
to discover learning difficulty, sources of motivation, and uses of self- 
teaching tests. This movement has affected the grade placement of items of 
subjectmatter, the definiteness of objectives, and the use by pupils of self- 
appraisal and practice tests. 

Tests to discover learning difficulty—A current type of study directed 
toward improvement of learning is that in which tests are formulated for 
the purpose of investigating the question at issue. Typical examples of 
such studies are to be found in the field of problem solving in arithmetic. 
With contrasting sets of material, Washburne and Morphett (460) showed 
that the children studied succeeded somewhat better when the problems 
involved familiar situations. In an extensive study of 350,000 problem solu- 
tions, Hydle and Clapp (374) examined the effect of eight elements of 
problem difficulty. In like manner, Monroe (393), with materials selected 
to suit the purpose, made careful analysis of the nature of pupils’ mental 
processes in solving problems. From another angle Bowman (321) studied 
the relation between children’s success with materials for which they ex- 
pressed a preference, the r between success and reported preference being 
56. Wheat (462) measured, with especially selected materials, the differ- 
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ences in response to problems of the conventional and imaginative type. 
Kramer (380) investigated the effect of four factors, interest, problem form, 
vocabulary, and language details, upon sixth-grade children’s success in 
the solution of the verbal arithmetic problem. Among the studies which 
throw light on the learning difficulty of subjectmatter items are Courtis 
Standard Practice Tests in Arithmetic (340) ; Osburn’s work in arithmetic 
(408); the Compass Diagnostic Tests in Arithmetic by Ruch, Knight, 
Greene, and Studebaker (422) ; Diagnostic Tests in Arithmetic by Brueck- 
ner and others (326). Washburne (457) described a group of experiments 
to determine the grade placement of several topics in arithmetic; Pressey’s 
test (415) yielded a statistical study of children’s errors in sentence struc- 
ture; Renwick (419) used tests,in determining children’s difficulties in the 
study of mensuration; Morton (397, 398, 399) made an analysis of errors 
in the solution of arithmetic problems; Streitz (435), by using tests, dis- 
covered difficulties in arithmetic and their correctives; Brigham (322) 
made a study of error in the college entrance examination; and Goodenough 
(360), through the use of tests, worked out a problem of efficiency in learn- 
ing and the accomplishment ratio. 

Sources of motivation have been studied in a more convincing manner 
through the use of tests. O’Shea’s study (409) of the effect of interest of a 
passage on learning vocabulary used intelligence, reading, comprehension, 
and vocabulary, and also special objective vocabulary tests. She (409: 44, 
49) stated that the significant fact about a pupil’s performance is the 
changes in his score from the original test to the retest, and that the vocabu- 
lary tests are as comparable with the standard tests as the standard tests are 
with each other. Uhl (450) showed the use of standardized materials in 
arithmetic for diagnosing pupils’ methods of work. Symonds (436) used 
the Charters Diagnostic Language Tests in grade six in six New York public 
schools and found that test motivation caused learning over and above 
that which could be explained by practice. The value of the test motivation 
can be estimated as the equivalent of five sheer repetitions. Book and Norvell 
(320) discussed the will to learn—an experimental study of incentives 
which included the use of tests. Curtis and Woods (343) showed the use 
of new-type tests as a teaching device in science and offered a method of 
correction which requires the least of the teacher’s time and energy. 

The Winnetka technic (403) of individualized instruction put renewed 
emphasis on the use of tests as a teaching instrument. Other experimentally 
determined studies of the effect of practice exercises include such studies as 
that by Leonard (382), who used objective tests to measure eighth- and 
ninth-grade pupils’ abilities in punctuation and capitalization and practice 
exercises as a method of improving pupils’ ability to write compositions 
free from errors. His results of the use of practice exercises are statistically 
convincing both in terms of objective proof-reading tests and in composition 
writing; the pupil in the experimental group using the practice tests did 


58 

















almost twice as well as the pupils taught by other methods. Ziegler (466) 
has convincing objective evidence that test-determined teaching of liter- 
ature in secondary schools produces greater appreciation of literature as 
well as improvement in silent reading through the use of practice exercises. 


CRITICAL ISSUES 


Consideration of present uses of educational tests suggests certain critical 
issues, some of which are treated elsewhere in this volume, but which are so 
vital that they justify restressing. Foremost among these is the original 
“ubiquitous probable error,” present from the first, but too frequently for- 
gotten in the enthusiasm of novices. No one has set forth the significance 
of this item better than Kelley (378). He rightly pointed out the sharp statis- 
tical distinction in the reliability between group measurements and individ- 
ual measurements. There is a tendency for the average user of tests to for- 
get that whereas medians obtained from the tests utilized may have a satis- 
factory degree of reliability, individual scores in many cases do not. The 
present tendency in the testing movement in public schools is, at least, dis- 
tinctly toward greater and greater analysis of learning difficulties as dis- 
closed by test scores. Moreover, the tendency is not alone toward emphasiz- 
ing individual pupils’ total scores but toward utilizing individual pupils’ 
scores on items in a test. Now it is well known that a pupil’s reaction to a 
single question is highly unreliable. Yet this fact is often overlooked. Thus 
certainty that a pupil actually does not know that 7 + 8 = 15 cannot be 
safely inferred from a single trial, but this is probably frequently being 
done. The need here is for longer and more analytical tests. Some progress 
has been made along this line. Examples are the Compass Arithmetic Tests 
(422) and Osburn’s work in arithmetic (408). 

Another critical issue that has been vigorously raised by Kelley is the 
fallacious assumption that intelligence and achievement tests measure 
essentially different attributes. If his contention that the typical intelli- 
gence test measures 90 percent of the identical traits measured by a good 
achievement test is true, it vitiates many of the rathe: common procedures 
of comparing a pupil’s achievement score with his intelligence score. In 
particular, as he points out, because of the almost total absence of zero 
points in all tests, quotient technics are particularly treacherous. 

Another critical issue which should challenge persos working in the 
field of educational measurement deals not so much with methods as with 
objectives. Broadly speaking the tools of educational measurement have 
been utilized chiefly in the attempt to improve educational methods, It seems 
to this reviewer, at least, that a far more urgent need at the present juncture 
is for research in the field of objectives or aims in the field of education. 
Tyler (446) did pioneer work in this direction in attempting to enlarge the 
scope of tests in certain fields to measure more adequately the total objec- 
tives of courses offered than has been done in the past. Unquestionably the 


59 








most urgent problems in education at present are sociological in nature. 
and while the application of measurement technics to sociological problems 
may be vastly more difficult than are the problems of methods of teaching, 
the issue cannot be escaped on this ground. The technics of educational 
measurement should aim to contribute as much to the problem of what 
type of training the present-day school graduate should receive as they 
have in the appraisal of an improvement in the quality of that training. 
This may involve totally new technics and years of effort, but the challenge 
cannot be avoided. 

In the use of tests in typical school systems, an important principle ap- 
parently is that of programming tests in some sort of systematically recur- 
ring schedules as opposed to sporadic testings. Enormously greater gains 
can thus be obtained from test results due to what might be called unearned 
increments. Systematically recurring testing programs make possible greatly) 
enriched programs of supervision and instructional research that are not 
possible on the basis of sporadic tests. The work of the Baltimore Bureau 
of Research (433) is an example of this type of procedure. In this school 
system educational tests are given to every pupil in the system at the begin- 
ning of each term in accordance with long range planning. 

Another point, though not strictly a matter of testing but rather a matter 
of principles of education, may be mentioned. It is the levelling process 
which seems too frequently to follow testing programs. Kelley (377) ad- 
mirably set forth arguments against levelling of pupils’ abilities after dis- 
covering individual idiosyncrasies and arguments for the preservation of 
these idiosyncrasies. This reviewer is in agreement with his point of view. 
Individual differences, while so troublesome in neat mechanical schemes of 
school administration, are none the less unquestionably our greatest assets 
and for this reason should be preserved in-so-far as is practical. 

Another critical issue is the need for mechanization of test procedures to 
reduce the time and labor cost. To make full use of even the tests already 
available requires more time and energy than is generally available under 
present-day school conditions. Since a large part of the work, particularly 
that of scoring and gross tabulating, is essentially mechanical in nature, 
there is great need for mechanical means for carrying out this work so as 
to relieve teachers and test technicians for more interpretative work. It is 
an encouraging sign to note a number of movements in this direction. As 
the demand becomes stronger, unquestionably machinery will be developed 
for doing much of this work. Several examples of efforts along this line 
may be noted. Thus Clapp and Young’s (336) contribution of the carbon 
offset scoring process, and Toops’s (444) envelope scheme for writing all 
answers of a test on a single sheet to economize test booklets tend to reduce 
labor. In the computing field, a number of technics have been reported 
(346, 348, 413) whereby correlations and other computations may be com- 
puted by means of standard computing machinery. Hull (373) described 
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the most elaborate computing machine for making the almost interminable 
calculations utilized in the partial and multiple correlation technics for 
combining tests according to optimum weights. A considerable number of 
charts, tables, and slide rules have recently appeared. Among these may 
be mentioned Dunlap and Kurtz’s Handbook of Statistical Nomographs 
(346), the Otis Correlation Chart (410), the Universal Percentile Graph 
(412), the Kelley Correlation Chart (377), the Stenquist Teachers Class 
Analysis Charts (434). 

The heaviest load of the testing program still remains, as before, the 
basic scoring of tests. Unofficial reports indicate that a number of persons 
are at work in the development of machines for doing this work, and it may 
confidently be hoped that in the near future much of this burden may be 
accomplished by this means. Until such facilities are provided, the full 
value of tests can never be obtained on any large scale. The diagnostic 
values possible by taking account of individual performances on individual 
items under the proper conditions are at present, for the most part, lost 
because of the huge labor load involved. 
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